knitr::include_graphics("2018-mercedes-a-class-hatchback-with-night-package.jpg")
Cinli bir otomobil sirketi Geely Auto, orada uretim birimlerini kurarak ve ABD ve Avrupali meslektaslarina rekabet edebilmek icin yerel olarak otomobil ureterek ABD pazarina girmeyi hedefliyor. Otomobillerin fiyatlandirmasinin hangi faktorlere bagli oldugunu anlamak icin bir otomobil danismanlik sirketi ile anlastilar.Ozellikle , Amerikan pazarinda araba fiyatlandirmasini etkileyen faktorleri anlamak istiyorlar cunku bunlar Cin pazarindan cok farkli olabilir.Sirket bilmek istiyor: Bir arabanin fiyatini tahmin etmede hangi degiskenler onemlidir.Bu degiskenler bir arabanin fiyatini ne kadar iyi tanimlamaktadir.Danismanlik sirketi,cesitli pazar arastirmalarina dayanarak, Amerika pazarinda farkli tipte araclardan olusan buyuk bir veri seti topladi.
Veri setimiz 26 degiskenli ve 205 gozlemlidir.
Mevcut bagimsiz degiskenlere sahip araclarin fiyatini modellememiz gerekmektedir.Yonetim tarafindan fiyatlarin bagimsiz degiskenlerle tam olarak nasil degistigini anlamak icin kullanilacaktir.Bu nedenle,belirli fiyat seviyelerini karsilamak icin arabalarin tasarimini, is stratejisini vb. Manipule edebilirler.Ayrica,model yonetimin yeni bir pazarin fiyatlandirma dinamiklerini anlamasi icin iyi bir yol olacaktir.
Degiskenler:
car_ID:Araba Numarasi
symboling:Simgesel
CarName:Araba modeli
fueltype:Yakit tipi(gaz/dizel)
aspiration:Havalandirma
doornumber:kapi sayisi
carbody:Arac govdesi
drivewheel:Kuvvet ileten ve torku lastiklerden yola cekis kuvvetine donusturerek aracin hareket etmesine neden olan bir motorlu tasitin tekerlegidir.
enginelocation:Motor yeri
wheelbase:Tekerlek acikligi
carlength:Araba boyu
carwidth:Araba genisligi
carheight:Araba yuksekligi
curbweight:Firen agirligi
enginetype:Motor tipi
cylindernumber:Silindir numarasi
enginesize:Motor genisligi
fuelsystem:Yakit sistemi
boreratio:Pistonlu bir piston motorunda,delik/strok orani veya strok/delik orani ile tanimlanan strok orani, silindir capi ve piston strok uzunlugu arasindaki orani aciklayan bir terimdir.
stroke:Inme (pistonun her iki yonde silindir boyunca tam hareketini ifade eder)
compressionratio:Sikistirma orani
horsepower:Beygir gucu
peakrpm:En yuksek devir
citympg:Sehir mpg
highwaympg:Kara yolu mpg
price:Fiyat
library(readr)
veri <- read.csv("C:/Users/Melike/Desktop/CarPrice_Assignment.csv",header = T)
head(veri,20)
## car_ID symboling CarName fueltype aspiration doornumber
## 1 1 3 alfa-romero giulia gas std two
## 2 2 3 alfa-romero stelvio gas std two
## 3 3 1 alfa-romero Quadrifoglio gas std two
## 4 4 2 audi 100 ls gas std four
## 5 5 2 audi 100ls gas std four
## 6 6 2 audi fox gas std two
## 7 7 1 audi 100ls gas std four
## 8 8 1 audi 5000 gas std four
## 9 9 1 audi 4000 gas turbo four
## 10 10 0 audi 5000s (diesel) gas turbo two
## 11 11 2 bmw 320i gas std two
## 12 12 0 bmw 320i gas std four
## 13 13 0 bmw x1 gas std two
## 14 14 0 bmw x3 gas std four
## 15 15 1 bmw z4 gas std four
## 16 16 0 bmw x4 gas std four
## 17 17 0 bmw x5 gas std two
## 18 18 0 bmw x3 gas std four
## 19 19 2 chevrolet impala gas std two
## 20 20 1 chevrolet monte carlo gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 1 convertible rwd front 88.6 168.8 64.1 48.8
## 2 convertible rwd front 88.6 168.8 64.1 48.8
## 3 hatchback rwd front 94.5 171.2 65.5 52.4
## 4 sedan fwd front 99.8 176.6 66.2 54.3
## 5 sedan 4wd front 99.4 176.6 66.4 54.3
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## 7 sedan fwd front 105.8 192.7 71.4 55.7
## 8 wagon fwd front 105.8 192.7 71.4 55.7
## 9 sedan fwd front 105.8 192.7 71.4 55.9
## 10 hatchback 4wd front 99.5 178.2 67.9 52.0
## 11 sedan rwd front 101.2 176.8 64.8 54.3
## 12 sedan rwd front 101.2 176.8 64.8 54.3
## 13 sedan rwd front 101.2 176.8 64.8 54.3
## 14 sedan rwd front 101.2 176.8 64.8 54.3
## 15 sedan rwd front 103.5 189.0 66.9 55.7
## 16 sedan rwd front 103.5 189.0 66.9 55.7
## 17 sedan rwd front 103.5 193.8 67.9 53.7
## 18 sedan rwd front 110.0 197.0 70.9 56.3
## 19 hatchback fwd front 88.4 141.1 60.3 53.2
## 20 hatchback fwd front 94.5 155.9 63.6 52.0
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 1 2548 dohc four 130 mpfi 3.47 2.68
## 2 2548 dohc four 130 mpfi 3.47 2.68
## 3 2823 ohcv six 152 mpfi 2.68 3.47
## 4 2337 ohc four 109 mpfi 3.19 3.40
## 5 2824 ohc five 136 mpfi 3.19 3.40
## 6 2507 ohc five 136 mpfi 3.19 3.40
## 7 2844 ohc five 136 mpfi 3.19 3.40
## 8 2954 ohc five 136 mpfi 3.19 3.40
## 9 3086 ohc five 131 mpfi 3.13 3.40
## 10 3053 ohc five 131 mpfi 3.13 3.40
## 11 2395 ohc four 108 mpfi 3.50 2.80
## 12 2395 ohc four 108 mpfi 3.50 2.80
## 13 2710 ohc six 164 mpfi 3.31 3.19
## 14 2765 ohc six 164 mpfi 3.31 3.19
## 15 3055 ohc six 164 mpfi 3.31 3.19
## 16 3230 ohc six 209 mpfi 3.62 3.39
## 17 3380 ohc six 209 mpfi 3.62 3.39
## 18 3505 ohc six 209 mpfi 3.62 3.39
## 19 1488 l three 61 2bbl 2.91 3.03
## 20 1874 ohc four 90 2bbl 3.03 3.11
## compressionratio horsepower peakrpm citympg highwaympg price
## 1 9.0 111 5000 21 27 13495.00
## 2 9.0 111 5000 21 27 16500.00
## 3 9.0 154 5000 19 26 16500.00
## 4 10.0 102 5500 24 30 13950.00
## 5 8.0 115 5500 18 22 17450.00
## 6 8.5 110 5500 19 25 15250.00
## 7 8.5 110 5500 19 25 17710.00
## 8 8.5 110 5500 19 25 18920.00
## 9 8.3 140 5500 17 20 23875.00
## 10 7.0 160 5500 16 22 17859.17
## 11 8.8 101 5800 23 29 16430.00
## 12 8.8 101 5800 23 29 16925.00
## 13 9.0 121 4250 21 28 20970.00
## 14 9.0 121 4250 21 28 21105.00
## 15 9.0 121 4250 20 25 24565.00
## 16 8.0 182 5400 16 22 30760.00
## 17 8.0 182 5400 16 22 41315.00
## 18 8.0 182 5400 15 20 36880.00
## 19 9.5 48 5100 47 53 5151.00
## 20 9.6 70 5400 38 43 6295.00
summary(veri)
## car_ID symboling CarName fueltype
## Min. : 1 Min. :-2.0000 Length:205 Length:205
## 1st Qu.: 52 1st Qu.: 0.0000 Class :character Class :character
## Median :103 Median : 1.0000 Mode :character Mode :character
## Mean :103 Mean : 0.8341
## 3rd Qu.:154 3rd Qu.: 2.0000
## Max. :205 Max. : 3.0000
## aspiration doornumber carbody drivewheel
## Length:205 Length:205 Length:205 Length:205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## enginelocation wheelbase carlength carwidth
## Length:205 Min. : 86.60 Min. :141.1 Min. :60.30
## Class :character 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10
## Mode :character Median : 97.00 Median :173.2 Median :65.50
## Mean : 98.76 Mean :174.0 Mean :65.91
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90
## Max. :120.90 Max. :208.1 Max. :72.30
## carheight curbweight enginetype cylindernumber
## Min. :47.80 Min. :1488 Length:205 Length:205
## 1st Qu.:52.00 1st Qu.:2145 Class :character Class :character
## Median :54.10 Median :2414 Mode :character Mode :character
## Mean :53.72 Mean :2556
## 3rd Qu.:55.50 3rd Qu.:2935
## Max. :59.80 Max. :4066
## enginesize fuelsystem boreratio stroke
## Min. : 61.0 Length:205 Min. :2.54 Min. :2.070
## 1st Qu.: 97.0 Class :character 1st Qu.:3.15 1st Qu.:3.110
## Median :120.0 Mode :character Median :3.31 Median :3.290
## Mean :126.9 Mean :3.33 Mean :3.255
## 3rd Qu.:141.0 3rd Qu.:3.58 3rd Qu.:3.410
## Max. :326.0 Max. :3.94 Max. :4.170
## compressionratio horsepower peakrpm citympg
## Min. : 7.00 Min. : 48.0 Min. :4150 Min. :13.00
## 1st Qu.: 8.60 1st Qu.: 70.0 1st Qu.:4800 1st Qu.:19.00
## Median : 9.00 Median : 95.0 Median :5200 Median :24.00
## Mean :10.14 Mean :104.1 Mean :5125 Mean :25.22
## 3rd Qu.: 9.40 3rd Qu.:116.0 3rd Qu.:5500 3rd Qu.:30.00
## Max. :23.00 Max. :288.0 Max. :6600 Max. :49.00
## highwaympg price
## Min. :16.00 Min. : 5118
## 1st Qu.:25.00 1st Qu.: 7788
## Median :30.00 Median :10295
## Mean :30.75 Mean :13277
## 3rd Qu.:34.00 3rd Qu.:16503
## Max. :54.00 Max. :45400
veri$car_ID <-factor(veri$car_ID)
Verimizdeki car_ID ’ i yani araba numarasini faktor olarak atadik.
Verimizi oncelikle test ve train olarak ayiralim.
set.seed(2)
index<-sample(1:nrow(veri),round(nrow(veri)*0.85))
veritrain<-veri[index,]
veritest<-veri[-index,]
Regresyon modelimizi kuralim
Verimizdeki yanit degiskeni price e karsilik aciklayici degiskenleri (wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg) kullanarak regresyon modeli kurulur.
lmod<- lm(price~wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg,data=veritrain)
summary(lmod)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11683.9 -1816.6 -17.8 1412.3 13859.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.047e+04 1.686e+04 -2.401 0.017499 *
## wheelbase 1.302e+02 1.127e+02 1.155 0.249710
## carlength -7.222e+01 6.265e+01 -1.153 0.250699
## carwidth 4.112e+02 2.706e+02 1.519 0.130684
## carheight 2.048e+02 1.477e+02 1.387 0.167467
## curbweight 8.278e-01 1.907e+00 0.434 0.664829
## enginesize 1.255e+02 1.586e+01 7.915 3.88e-13 ***
## boreratio -1.585e+03 1.424e+03 -1.113 0.267413
## stroke -3.455e+03 8.880e+02 -3.891 0.000146 ***
## compressionratio 3.848e+02 9.918e+01 3.880 0.000152 ***
## horsepower 2.715e+01 1.797e+01 1.511 0.132730
## peakrpm 2.276e+00 7.063e-01 3.222 0.001544 **
## citympg -3.745e+02 2.041e+02 -1.835 0.068419 .
## highwaympg 1.648e+02 1.801e+02 0.915 0.361357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3206 on 160 degrees of freedom
## Multiple R-squared: 0.8576, Adjusted R-squared: 0.846
## F-statistic: 74.1 on 13 and 160 DF, p-value: < 2.2e-16
Kurulan regresyon modelinin anlamliligina baktigimizda p value yaklasik 0’dir.Yani 0.05’ten kucuk oldugundan kurulan model anlamlidir.
Sabit varyansliligin en kullanisli teshis yontemi artiklara (residuals(lmod)) karsilik tahmin (fitted(lmod)) degerlerinin plotlanmasidir.
op = par(bg = "seashell2")
plot(fitted(lmod),residuals(lmod), xlab = "fitted y" ,ylab = "residuals",col="purple",main="Artiklar-Tahmin Grafigi")
abline(h=0,col="yellow")
#Interaktif bir sekilde gorsellestirelim ;
op = par(bg = "seashell2")
library(plotly)
p <- plot_ly(veritrain, x = fitted(lmod), y = residuals(lmod), alpha = 0.3)
subplot(
add_markers(p, size = 2, name = "default"),
add_markers(p, size = 2, sizes = c(1, 205), name = "custom")
)
#Interaktif plotun gorseli;(Html ciktisini pdf e cevirince gozukmedigi icin ekledik)
knitr::include_graphics("varsayimkontrolu1.png")
Cizdirgimiz grafikte sifir etrafinda nasil dagildigini gormek icin h=0 ile yataya cizgi ekledik.Grafigimiz bize duzgun bir sekil vermedigi icin sabit varyansli mi diye emin olamiyoruz.Bunun icin degisken varyanslilik testlerine bakmaliyiz.
H0:Heterosce Dosticity (Degisken Varyanslilik) problemi yok. H1:Heterosce Dosticity (Degisken Varyanslilik) problemi vardir.
#install.packages("lmtest")
library(lmtest)
bptest(lmod,data=veritrain)
##
## studentized Breusch-Pagan test
##
## data: lmod
## BP = 71.718, df = 13, p-value = 3.869e-10
Breusch pagan testimizin sonucuna gore p-value degerimiz (3.869e-10) yaklasik 0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugundan H0 hipotezi reddedilir yani heterocedosticity (degisken varyanslilik) problemi vardir deriz.
wmod<-lm(residuals(lmod)^2~fitted(lmod)+fitted(lmod)^2,veritrain)
summary(wmod)
##
## Call:
## lm(formula = residuals(lmod)^2 ~ fitted(lmod) + fitted(lmod)^2,
## data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36064886 -5294096 -829388 2258929 160785358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.094e+07 2.938e+06 -3.725 0.000265 ***
## fitted(lmod) 1.539e+03 1.927e+02 7.986 1.91e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19170000 on 172 degrees of freedom
## Multiple R-squared: 0.2705, Adjusted R-squared: 0.2663
## F-statistic: 63.78 on 1 and 172 DF, p-value: 1.911e-13
White testimizin sonucuna gore p-value degerimiz (1.911e-13) yaklasik 0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugundan H0 hipotezi reddedilir yani heterocedosticity (degisken varyanslilik) problemi vardir deriz.
Varyanslarin homojenligi saglanmamasi durumunda “Yanit degiskeni uzerinde donusum yapmak” veya “Agirlikli En Kucuk Kareler” yontemlerine basvururuz.
Siradan en kucuk kareler yontemi, hata varyanslarinin sabit oldugunu varsayar(homoscedasticity). Agirlikli en kucuk kareler yontemi bu varsayim saglanmadigi durumlarda kullanilir.
Varyansi buyuk olan degiskenin model uzerinde etkisi fazla olur. EKK nin en iyi calisabilmesi icin hata varyansi σ2i i=1,2,….n birbirine esit olmasi gerekir. Eger σ2i ler esit degil ise agirlikli en kucuk kareler yontemine gecmeliyiz.
Her bir σ2i varyansina karsilik wi= 1/σ2i agirligini tanimlariz.(Agirliklari 1/σ2i almamizin sebebi hepsinin varsayansini 1 e ve birbirlerine esitlemeye calismamizdir. σ2i*(1/σ2i))
Bu sekilde agirligi buyuk olanin agirligini alip, agirligi kucuk olana agirlik yukluyoruz. Buradaki zorluk σ2i parametresinin bilinmemesinden kaynaklanir ve w matrisi kolay belirlenemez.
Agirliklari belirlemek icin ilk olarak EKK regresyon modeli kurulur artiklar elde edilir.Agirliklar belirlendikten sonra agirliklandirilmis EKK kullanilir.Regresyon modeli uzerinden artiklar hesaplanip agirlik incelemesi yapilir. Gercekten kullanilan agirlikla varyans homojen hale gelmis mi diye bakilir.Eger varyans homojenligi saglanmamissa tekrar agirliklandirma yapilir buna iteratif agirliklandirma denir.
Bazi olasi varyans ve standart sapma fonksiyonu tahminleri:
Aciklayici degiskenlere karsilik artiklarin grafigini cizdirip megafon sekli varmi diye bakilir.Eger megafon sekli var ise aciklayici degiskenler ile mutlak artiklarin arasinda regresyon modeli kurulur.Bu regresyon modeli uzerinden tahmin degeri elde edilir.Bu elde elilen tahmin degerlerini σi yerine kullaniriz.Agirliklar da 1/σ2i diye olusturulur.
Ilk basta kurulan EKK modelindeki tahmin edilen yanitlara (y sapkalara) karsilik ei lerin grafigi megafon seklinde ise artiklarin mutlak degerine karsilik y sapkalarin regresyon modeli kurulur.Bu kurulan modelden elde edilen tahmin degerlerini σi yerine kullaniriz.Agirliklari da wi= 1/σ2i seklinde olustururuz.
Aciklayici degiskene karsi ei kare grafigi artan seklindeyse ei karelere karsilik o aciklayici degiskenin regresyon modeli kurulur.Bu modelin tahminlerini σ2i yerine kullaniriz.Agirliklari da wi= 1/σ2i seklinde olustururuz.
Tahmin edilen yanitlara (y sapka) karsi ei kare grafigi artan seklindeyse ei karelere karsilik tahmin edilen yanitlara regresyon modeli kurulur. Bu modelden elde edilen tahminler σ2i tahminleri olarak kullanilir. Agirliklari da wi= 1/σ2i seklinde olustururuz.
Bunlardan hangisi daha uygun gorulurse agirlik o sekilde belirlenmelidir.
Simdi verimiz uzerinde bu anlatilanlari yapmaya baslayalim.
library(ggplot2)
veritrain$artik <-residuals(lmod) #EKK modelinden elde edilen artiklar (residuals)
veritrain$tahmin<-predict(lmod) #EKK modelinden elde edilen tahminler (prediction)
head(veritrain)
## car_ID symboling CarName fueltype aspiration doornumber
## 85 85 3 mitsubishi mirage g4 gas turbo two
## 198 198 -1 volvo 245 gas std four
## 6 6 2 audi fox gas std two
## 160 160 0 toyota corolla diesel std four
## 136 136 2 saab 99gle gas std four
## 17 17 0 bmw x5 gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 85 hatchback fwd front 95.9 173.2 66.3 50.2
## 198 wagon rwd front 104.3 188.8 67.2 57.5
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## 160 hatchback fwd front 95.7 166.3 64.4 52.8
## 136 sedan fwd front 99.1 186.6 66.5 56.1
## 17 sedan rwd front 103.5 193.8 67.9 53.7
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 85 2926 ohc four 156 spdi 3.59 3.86
## 198 3042 ohc four 141 mpfi 3.78 3.15
## 6 2507 ohc five 136 mpfi 3.19 3.40
## 160 2275 ohc four 110 idi 3.27 3.35
## 136 2758 ohc four 121 mpfi 3.54 3.07
## 17 3380 ohc six 209 mpfi 3.62 3.39
## compressionratio horsepower peakrpm citympg highwaympg price artik
## 85 7.0 145 5000 19 24 14489 -391.0747
## 198 9.5 114 5400 24 28 16515 -380.0029
## 6 8.5 110 5500 19 25 15250 -731.0925
## 160 22.5 56 4500 38 47 7788 -2357.2862
## 136 9.3 110 5250 21 28 15510 1200.2336
## 17 8.0 182 5400 16 22 41315 13859.4870
## tahmin
## 85 14880.07
## 198 16895.00
## 6 15981.09
## 160 10145.29
## 136 14309.77
## 17 27455.51
EKK modelimizdeki elde edilen artiklar ve tahmin degerlerini verimize degisken olarak ekledik.
Simdi “Bazi olasi varyans ve standart sapma fonksiyonu tahminleri” nden birincisini inceleyelim.
op = par(bg = "papayawhip")
pairs(~artik+wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg+tahmin ,data=veritrain, main="Temel Dagilim Grafigi Matrisi")
Artiklar ile compressionratio sacinim grafigi megafon seklindedir. Bu bagimsiz degisken ile artiklarin mutlak degeri arasinda regresyon modeli kuralim.
model1<-lm(abs(veritrain$artik)~compressionratio,data=veritrain)
weights1<-1/(predict(model1))^2 # agirliklar wi= 1/σ2i
veritrain<-veritrain[, -c(27,28)]
weightedleastsquaremod1<-lm(price~wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg, data= veritrain, weights = weights1)
summary(weightedleastsquaremod1)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights1)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.4060 -0.8378 -0.0060 0.6556 6.4207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.046e+04 1.685e+04 -2.401 0.017496 *
## wheelbase 1.302e+02 1.127e+02 1.155 0.249807
## carlength -7.183e+01 6.268e+01 -1.146 0.253486
## carwidth 4.096e+02 2.706e+02 1.514 0.132009
## carheight 2.048e+02 1.477e+02 1.387 0.167395
## curbweight 8.208e-01 1.907e+00 0.431 0.667390
## enginesize 1.257e+02 1.585e+01 7.930 3.55e-13 ***
## boreratio -1.582e+03 1.425e+03 -1.111 0.268332
## stroke -3.461e+03 8.874e+02 -3.900 0.000141 ***
## compressionratio 3.846e+02 9.961e+01 3.861 0.000164 ***
## horsepower 2.713e+01 1.797e+01 1.510 0.133042
## peakrpm 2.279e+00 7.060e-01 3.228 0.001512 **
## citympg -3.753e+02 2.043e+02 -1.837 0.068021 .
## highwaympg 1.662e+02 1.802e+02 0.922 0.357869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.484 on 160 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8459
## F-statistic: 74.03 on 13 and 160 DF, p-value: < 2.2e-16
Summary kodumuza baktigimizda Residual standard error : 1.484 ve Adjusted R-squared : 0.8459 cikmistir.
Simdi “Bazi olasi varyans ve standart sapma fonkiyonu tahminleri” nden ikincisini inceleyelim;
2)Ilk basta kurulan EKK modelindeki tahmin edilen yanitlara (y sapkalara) karsilik ei lerin grafigi megafon seklinde ise artiklarin mutlak degerine karsilik y sapkalarin regresyon modeli kurulur.Bu kurulan modelden elde edilen tahmin degerlerini σi yerine kullaniriz.Agirliklari da wi= 1/σ2i seklinde olustururuz.
library(ggplot2)
op = par(bg = "lavender")
veritrain$artik<-(residuals(lmod)) #EKK modelinden elde edilen artiklar (residuals)
veritrain$tahmin<-predict(lmod) #EKK modelinden elde edilen tahminler (prediction)
head(veritrain)
## car_ID symboling CarName fueltype aspiration doornumber
## 85 85 3 mitsubishi mirage g4 gas turbo two
## 198 198 -1 volvo 245 gas std four
## 6 6 2 audi fox gas std two
## 160 160 0 toyota corolla diesel std four
## 136 136 2 saab 99gle gas std four
## 17 17 0 bmw x5 gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 85 hatchback fwd front 95.9 173.2 66.3 50.2
## 198 wagon rwd front 104.3 188.8 67.2 57.5
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## 160 hatchback fwd front 95.7 166.3 64.4 52.8
## 136 sedan fwd front 99.1 186.6 66.5 56.1
## 17 sedan rwd front 103.5 193.8 67.9 53.7
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 85 2926 ohc four 156 spdi 3.59 3.86
## 198 3042 ohc four 141 mpfi 3.78 3.15
## 6 2507 ohc five 136 mpfi 3.19 3.40
## 160 2275 ohc four 110 idi 3.27 3.35
## 136 2758 ohc four 121 mpfi 3.54 3.07
## 17 3380 ohc six 209 mpfi 3.62 3.39
## compressionratio horsepower peakrpm citympg highwaympg price artik
## 85 7.0 145 5000 19 24 14489 -391.0747
## 198 9.5 114 5400 24 28 16515 -380.0029
## 6 8.5 110 5500 19 25 15250 -731.0925
## 160 22.5 56 4500 38 47 7788 -2357.2862
## 136 9.3 110 5250 21 28 15510 1200.2336
## 17 8.0 182 5400 16 22 41315 13859.4870
## tahmin
## 85 14880.07
## 198 16895.00
## 6 15981.09
## 160 10145.29
## 136 14309.77
## 17 27455.51
pairs(artik~tahmin,data=veritrain, main="Temel Dagilim Grafigi Matrisi")
model2<-lm(abs(veritrain$artik)~veritrain$tahmin)
weights2<-1/(predict(model2))^2 # agirliklar wi= 1/σ2i
veritrain<-veritrain[, -c(27,28)]
weightedleastsquaremod2<-lm(price~wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg, data= veritrain, weights = weights2)
summary(weightedleastsquaremod2)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.2490 -0.8685 -0.2192 0.9702 3.9746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.335e+04 1.323e+04 -1.009 0.314359
## wheelbase 1.042e+02 8.951e+01 1.164 0.246184
## carlength -8.017e+00 3.654e+01 -0.219 0.826614
## carwidth -7.609e+01 2.120e+02 -0.359 0.720203
## carheight 9.306e+01 8.874e+01 1.049 0.295882
## curbweight 3.683e+00 1.164e+00 3.164 0.001862 **
## enginesize 3.243e+01 1.508e+01 2.150 0.033046 *
## boreratio -1.834e+03 1.244e+03 -1.475 0.142110
## stroke -2.703e+03 7.813e+02 -3.460 0.000693 ***
## compressionratio 2.280e+02 7.213e+01 3.161 0.001884 **
## horsepower 9.971e+01 1.560e+01 6.393 1.7e-09 ***
## peakrpm 6.720e-01 4.735e-01 1.419 0.157774
## citympg -1.534e+02 1.099e+02 -1.396 0.164721
## highwaympg 2.104e+02 1.012e+02 2.080 0.039110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.245 on 160 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.7825
## F-statistic: 48.87 on 13 and 160 DF, p-value: < 2.2e-16
Summary kodumuza baktigimizda Residual standard error : 1.245 ve Adjusted R-squared : 0.7825 cikmistir.
Simdi “Bazi olasi varyans ve standart sapma fonkiyonu tahminleri” nden ucuncusunu inceleyelim;
3)Aciklayici degiskene karsi ei kare grafigi artan seklindeyse ei karelere karsilik o aciklayici degiskenin regresyon modeli kurulur.Bu modelin tahminlerini σ2i yerine kullaniriz.Agirliklari da wi= 1/σ2i seklinde olustururuz.
Artiklarimizin kareleri;
library(ggplot2)
veritrain$artik<-(residuals(lmod)) #EKK modelinden elde edilen artiklar (residuals)
veritrain$tahmin<-predict(lmod) #EKK modelinden elde edilen tahminler (prediction)
head(veritrain)
## car_ID symboling CarName fueltype aspiration doornumber
## 85 85 3 mitsubishi mirage g4 gas turbo two
## 198 198 -1 volvo 245 gas std four
## 6 6 2 audi fox gas std two
## 160 160 0 toyota corolla diesel std four
## 136 136 2 saab 99gle gas std four
## 17 17 0 bmw x5 gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 85 hatchback fwd front 95.9 173.2 66.3 50.2
## 198 wagon rwd front 104.3 188.8 67.2 57.5
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## 160 hatchback fwd front 95.7 166.3 64.4 52.8
## 136 sedan fwd front 99.1 186.6 66.5 56.1
## 17 sedan rwd front 103.5 193.8 67.9 53.7
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 85 2926 ohc four 156 spdi 3.59 3.86
## 198 3042 ohc four 141 mpfi 3.78 3.15
## 6 2507 ohc five 136 mpfi 3.19 3.40
## 160 2275 ohc four 110 idi 3.27 3.35
## 136 2758 ohc four 121 mpfi 3.54 3.07
## 17 3380 ohc six 209 mpfi 3.62 3.39
## compressionratio horsepower peakrpm citympg highwaympg price artik
## 85 7.0 145 5000 19 24 14489 -391.0747
## 198 9.5 114 5400 24 28 16515 -380.0029
## 6 8.5 110 5500 19 25 15250 -731.0925
## 160 22.5 56 4500 38 47 7788 -2357.2862
## 136 9.3 110 5250 21 28 15510 1200.2336
## 17 8.0 182 5400 16 22 41315 13859.4870
## tahmin
## 85 14880.07
## 198 16895.00
## 6 15981.09
## 160 10145.29
## 136 14309.77
## 17 27455.51
kareresid<-((veritrain$artik)^2)
Artiklarimizin karelerine karsilik gelen bagimsiz degiskenlerin grafigi;
op = par(bg = "lightgoldenrodyellow")
pairs(kareresid~wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg,data=veritrain, main="Temel Dagilim Grafigi Matrisi")
Aciklayici degiskene karsi ei kare grafigi artan seklinde oldugu icin ei karelere karsilik o aciklayici degiskenin regresyon modeli kurulur.
model3<-lm(kareresid~highwaympg,data=veritrain)
weights3<-1/(predict(model3))^2 # agirliklar wi= 1/σ2i
veritrain<-veritrain[, -c(27,28)]
weightedleastsquaremod3<-lm(price~wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg,data=veritrain, weights = weights3)
summary(weightedleastsquaremod3)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights3)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -6.765e-04 -1.873e-04 -2.366e-05 1.701e-04 8.327e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.434e+04 1.298e+04 -1.105 0.270809
## wheelbase 2.159e+01 6.214e+01 0.348 0.728661
## carlength 4.776e+01 2.044e+01 2.337 0.020681 *
## carwidth 1.671e+02 1.856e+02 0.900 0.369269
## carheight 8.916e+01 6.277e+01 1.421 0.157399
## curbweight 1.180e+00 1.127e+00 1.047 0.296549
## enginesize 7.262e+01 1.801e+01 4.032 8.52e-05 ***
## boreratio -4.311e+03 1.074e+03 -4.015 9.11e-05 ***
## stroke -2.805e+03 7.951e+02 -3.528 0.000546 ***
## compressionratio 3.662e+02 5.388e+01 6.796 2.01e-10 ***
## horsepower 7.426e+01 1.513e+01 4.907 2.26e-06 ***
## peakrpm 8.778e-01 4.041e-01 2.173 0.031288 *
## citympg -2.208e+02 6.539e+01 -3.377 0.000920 ***
## highwaympg 7.324e+01 9.576e+01 0.765 0.445526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002807 on 160 degrees of freedom
## Multiple R-squared: 0.8922, Adjusted R-squared: 0.8834
## F-statistic: 101.9 on 13 and 160 DF, p-value: < 2.2e-16
Summary kodumuza baktigimizda Residual standard error : 0.0002807 ve Adjusted R-squared : 0.8834 cikmistir.
Simdi “Bazi olasi varyans ve standart sapma fonkiyonu tahminleri” nden dorduncusunu inceleyelim;
Tahmin edilen yanitlara karsilik gelen hatalarin karelerinin grafigi;
library(ggplot2)
op = par(bg = "mintcream")
veritrain$artik<-(residuals(lmod)) #EKK modelinden elde edilen artiklar (residuals)
veritrain$tahmin<-predict(lmod) #EKK modelinden elde edilen tahminler (prediction)
head(veritrain)
## car_ID symboling CarName fueltype aspiration doornumber
## 85 85 3 mitsubishi mirage g4 gas turbo two
## 198 198 -1 volvo 245 gas std four
## 6 6 2 audi fox gas std two
## 160 160 0 toyota corolla diesel std four
## 136 136 2 saab 99gle gas std four
## 17 17 0 bmw x5 gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 85 hatchback fwd front 95.9 173.2 66.3 50.2
## 198 wagon rwd front 104.3 188.8 67.2 57.5
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## 160 hatchback fwd front 95.7 166.3 64.4 52.8
## 136 sedan fwd front 99.1 186.6 66.5 56.1
## 17 sedan rwd front 103.5 193.8 67.9 53.7
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 85 2926 ohc four 156 spdi 3.59 3.86
## 198 3042 ohc four 141 mpfi 3.78 3.15
## 6 2507 ohc five 136 mpfi 3.19 3.40
## 160 2275 ohc four 110 idi 3.27 3.35
## 136 2758 ohc four 121 mpfi 3.54 3.07
## 17 3380 ohc six 209 mpfi 3.62 3.39
## compressionratio horsepower peakrpm citympg highwaympg price artik
## 85 7.0 145 5000 19 24 14489 -391.0747
## 198 9.5 114 5400 24 28 16515 -380.0029
## 6 8.5 110 5500 19 25 15250 -731.0925
## 160 22.5 56 4500 38 47 7788 -2357.2862
## 136 9.3 110 5250 21 28 15510 1200.2336
## 17 8.0 182 5400 16 22 41315 13859.4870
## tahmin
## 85 14880.07
## 198 16895.00
## 6 15981.09
## 160 10145.29
## 136 14309.77
## 17 27455.51
pairs(kareresid~veritrain$tahmin, main="Temel Dagilim Grafigi Matrisi")
Tahmin edilen yanitlara karsilik hatalarin karelerinin grafigi artan seklinde olmadi icin bu yontemi kullanamayiz.
Simdi bu yontemlerden elde edilen verilerin ozetine bakalim;
summary(lmod)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11683.9 -1816.6 -17.8 1412.3 13859.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.047e+04 1.686e+04 -2.401 0.017499 *
## wheelbase 1.302e+02 1.127e+02 1.155 0.249710
## carlength -7.222e+01 6.265e+01 -1.153 0.250699
## carwidth 4.112e+02 2.706e+02 1.519 0.130684
## carheight 2.048e+02 1.477e+02 1.387 0.167467
## curbweight 8.278e-01 1.907e+00 0.434 0.664829
## enginesize 1.255e+02 1.586e+01 7.915 3.88e-13 ***
## boreratio -1.585e+03 1.424e+03 -1.113 0.267413
## stroke -3.455e+03 8.880e+02 -3.891 0.000146 ***
## compressionratio 3.848e+02 9.918e+01 3.880 0.000152 ***
## horsepower 2.715e+01 1.797e+01 1.511 0.132730
## peakrpm 2.276e+00 7.063e-01 3.222 0.001544 **
## citympg -3.745e+02 2.041e+02 -1.835 0.068419 .
## highwaympg 1.648e+02 1.801e+02 0.915 0.361357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3206 on 160 degrees of freedom
## Multiple R-squared: 0.8576, Adjusted R-squared: 0.846
## F-statistic: 74.1 on 13 and 160 DF, p-value: < 2.2e-16
summary(weightedleastsquaremod1)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights1)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.4060 -0.8378 -0.0060 0.6556 6.4207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.046e+04 1.685e+04 -2.401 0.017496 *
## wheelbase 1.302e+02 1.127e+02 1.155 0.249807
## carlength -7.183e+01 6.268e+01 -1.146 0.253486
## carwidth 4.096e+02 2.706e+02 1.514 0.132009
## carheight 2.048e+02 1.477e+02 1.387 0.167395
## curbweight 8.208e-01 1.907e+00 0.431 0.667390
## enginesize 1.257e+02 1.585e+01 7.930 3.55e-13 ***
## boreratio -1.582e+03 1.425e+03 -1.111 0.268332
## stroke -3.461e+03 8.874e+02 -3.900 0.000141 ***
## compressionratio 3.846e+02 9.961e+01 3.861 0.000164 ***
## horsepower 2.713e+01 1.797e+01 1.510 0.133042
## peakrpm 2.279e+00 7.060e-01 3.228 0.001512 **
## citympg -3.753e+02 2.043e+02 -1.837 0.068021 .
## highwaympg 1.662e+02 1.802e+02 0.922 0.357869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.484 on 160 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8459
## F-statistic: 74.03 on 13 and 160 DF, p-value: < 2.2e-16
summary(weightedleastsquaremod2)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.2490 -0.8685 -0.2192 0.9702 3.9746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.335e+04 1.323e+04 -1.009 0.314359
## wheelbase 1.042e+02 8.951e+01 1.164 0.246184
## carlength -8.017e+00 3.654e+01 -0.219 0.826614
## carwidth -7.609e+01 2.120e+02 -0.359 0.720203
## carheight 9.306e+01 8.874e+01 1.049 0.295882
## curbweight 3.683e+00 1.164e+00 3.164 0.001862 **
## enginesize 3.243e+01 1.508e+01 2.150 0.033046 *
## boreratio -1.834e+03 1.244e+03 -1.475 0.142110
## stroke -2.703e+03 7.813e+02 -3.460 0.000693 ***
## compressionratio 2.280e+02 7.213e+01 3.161 0.001884 **
## horsepower 9.971e+01 1.560e+01 6.393 1.7e-09 ***
## peakrpm 6.720e-01 4.735e-01 1.419 0.157774
## citympg -1.534e+02 1.099e+02 -1.396 0.164721
## highwaympg 2.104e+02 1.012e+02 2.080 0.039110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.245 on 160 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.7825
## F-statistic: 48.87 on 13 and 160 DF, p-value: < 2.2e-16
summary(weightedleastsquaremod3)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain,
## weights = weights3)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -6.765e-04 -1.873e-04 -2.366e-05 1.701e-04 8.327e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.434e+04 1.298e+04 -1.105 0.270809
## wheelbase 2.159e+01 6.214e+01 0.348 0.728661
## carlength 4.776e+01 2.044e+01 2.337 0.020681 *
## carwidth 1.671e+02 1.856e+02 0.900 0.369269
## carheight 8.916e+01 6.277e+01 1.421 0.157399
## curbweight 1.180e+00 1.127e+00 1.047 0.296549
## enginesize 7.262e+01 1.801e+01 4.032 8.52e-05 ***
## boreratio -4.311e+03 1.074e+03 -4.015 9.11e-05 ***
## stroke -2.805e+03 7.951e+02 -3.528 0.000546 ***
## compressionratio 3.662e+02 5.388e+01 6.796 2.01e-10 ***
## horsepower 7.426e+01 1.513e+01 4.907 2.26e-06 ***
## peakrpm 8.778e-01 4.041e-01 2.173 0.031288 *
## citympg -2.208e+02 6.539e+01 -3.377 0.000920 ***
## highwaympg 7.324e+01 9.576e+01 0.765 0.445526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002807 on 160 degrees of freedom
## Multiple R-squared: 0.8922, Adjusted R-squared: 0.8834
## F-statistic: 101.9 on 13 and 160 DF, p-value: < 2.2e-16
EKK modelimizin Residual standard error : 3206 ve Adjusted R-squared : 0.846 dir. 1. yontem ile agirliklandirma yaptigimizda Residual standard error : 1.484 ve Adjusted R-squared : 0.8459 dir. 2. yontem ile agirliklandirma yaptigimizda Residual standard error : 1.245 ve Adjusted R-squared : 0.7825 dir. 3. yontem ile agirliklandirma yaptigimizda Residual standard error : 0.0002807 ve Adjusted R-squared : 0.8834 dir.
Ayrica agirliklandirma yaptigimda kullanilan bagimsiz degiskenlerimizin katsayilarinda degisimler meydana gelmistir.
Simdi degisken varyanslilik probleminin giderildigini kontrol edelim.
par(mfrow=c(1,2),op = par(bg = "seashell"))
plot(predict(lmod),residuals(lmod))
plot(predict(weightedleastsquaremod2),residuals(weightedleastsquaremod2))
library(dplyr)
X<-model.matrix(lmod)
y<-veritrain$price #verimizdeki yanit degiskeni
W<-diag(weights2) #kosegenlerinde agirliklar olan matris
Z<-sqrt(W)%*%X #w matrisinin karekoku ile x in carpimi
yyildiz<-sqrt(W) %*% y
Bunlarin tanimlanmasi ile Agirliklandirilmis EKK modeli Basit EKK modeline donmustur.
Betawls<-solve(t(Z)%*%Z,t(Z)%*%yyildiz) #agirliklandirilmis ekk nin ß larini verir.
cbind(Betawls,coef(weightedleastsquaremod2))
## [,1] [,2]
## (Intercept) -1.335353e+04 -1.335353e+04
## wheelbase 1.041862e+02 1.041862e+02
## carlength -8.017161e+00 -8.017161e+00
## carwidth -7.608700e+01 -7.608700e+01
## carheight 9.306375e+01 9.306375e+01
## curbweight 3.683538e+00 3.683538e+00
## enginesize 3.242956e+01 3.242956e+01
## boreratio -1.834499e+03 -1.834499e+03
## stroke -2.702921e+03 -2.702921e+03
## compressionratio 2.279656e+02 2.279656e+02
## horsepower 9.970992e+01 9.970992e+01
## peakrpm 6.719929e-01 6.719929e-01
## citympg -1.534061e+02 -1.534061e+02
## highwaympg 2.104257e+02 2.104257e+02
donart<-yyildiz-Z%*%Betawls #donusturulmus model artiklari
head(cbind(donart,residuals(weightedleastsquaremod2)),15)
## [,1] [,2]
## 85 -0.30921405 -754.66900
## 198 0.20686925 576.60548
## 6 0.82862325 2179.31727
## 160 -1.35358069 -2200.82951
## 136 0.55325235 1295.98037
## 17 3.56377604 16408.85773
## 93 0.30303564 304.07710
## 125 -0.85841550 -2081.84692
## 41 0.38448240 545.59779
## 178 1.00980653 1443.67166
## 75 2.52285503 17073.73127
## 193 1.25566749 2230.36838
## 131 -0.14184893 -256.98656
## 50 -0.45388849 -3669.62884
## 115 -0.02284356 -71.89031
1.sutun donusturulmus artiklari , 2.sutun Agirliklandirilmis EKK modelinin artiklarini gostermektedir. Modelin artiklarina bakildiginda birbirinden ayri cikmistir.
lm modeli ile kurulan Agirliklandirilmis EKK modelinin artiklarini kok icinde w ile carparak donusturulmus modelin artiklari elde edilir.
head(cbind(sqrt(W)%*%residuals(weightedleastsquaremod2), donart),10)
## [,1] [,2]
## [1,] -0.3092141 -0.3092141
## [2,] 0.2068692 0.2068692
## [3,] 0.8286232 0.8286232
## [4,] -1.3535807 -1.3535807
## [5,] 0.5532524 0.5532524
## [6,] 3.5637760 3.5637760
## [7,] 0.3030356 0.3030356
## [8,] -0.8584155 -0.8584155
## [9,] 0.3844824 0.3844824
## [10,] 1.0098065 1.0098065
1.sutun donusturulmus artiklari , 2.sutun Agirliklandirilmis EKK modelinin artiklarini gostermektedir. Modelin artiklarina bakildiginda birbirleriyle ayni cikmistir.
Eger agirliklandirilmis ekk uzerinden residuals standart error hesaplarsak;
sqrt(sum(residuals(weightedleastsquaremod2)^2)/160)
## [1] 3852.062
Agirliklandirilmis ekk modelinin residuals standart erroru ile ayni cikmamistir.
Simdi donusturulmus artiklarin standart errorunu hesaplayalim;
sqrt(sum(donart^2)/160)
## [1] 1.245303
Simdi Agirliklandirilmis modelimiz icin BREUSCH-PAGAN TESTI yapalim;
BREUSCH-PAGAN TESTI
H0:Heterosce Dosticity (Degisken Varyanslilik) problemi yok. H1:Heterosce Dosticity (Degisken Varyanslilik) problemi vardir.
#install.packages("lmtest")
library(lmtest)
bpmod<-lm(donart^2~ wheelbase+carlength+carwidth+carheight+curbweight+enginesize+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg,data=veritrain)
summary(bpmod)
##
## Call:
## lm(formula = donart^2 ~ wheelbase + carlength + carwidth + carheight +
## curbweight + enginesize + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6784 -0.9304 -0.3982 0.3440 13.6828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.1931776 10.5135321 1.255 0.211
## wheelbase 0.0319411 0.0703184 0.454 0.650
## carlength -0.0141919 0.0390737 -0.363 0.717
## carwidth -0.2615631 0.1688115 -1.549 0.123
## carheight -0.0273559 0.0921155 -0.297 0.767
## curbweight 0.0012101 0.0011896 1.017 0.311
## enginesize 0.0112707 0.0098923 1.139 0.256
## boreratio -0.1957785 0.8881160 -0.220 0.826
## stroke -0.6093111 0.5538594 -1.100 0.273
## compressionratio 0.0066170 0.0618587 0.107 0.915
## horsepower 0.0080652 0.0112076 0.720 0.473
## peakrpm 0.0003820 0.0004405 0.867 0.387
## citympg 0.0806786 0.1273206 0.634 0.527
## highwaympg -0.0176289 0.1123087 -0.157 0.875
##
## Residual standard error: 1.999 on 160 degrees of freedom
## Multiple R-squared: 0.1004, Adjusted R-squared: 0.02734
## F-statistic: 1.374 on 13 and 160 DF, p-value: 0.1771
BREUSCH-PAGAN Testimizin sonucuna gore p-value degerimiz 0.1771 cikmistir.P-value degerimiz 0.05’ten buyuk oldugundan H0 hipotezi kabul edilir yani degisken varyanslilik problemi yoktur deriz. Goruldugu uzere agirliklandirma yaparak degisken varyanslilik problemini ortadan kaldirmis olduk.
Hargreaves Lansdown sirketi ile Spread Co Sirketinin altin fiyat veri setidir.Bu veri setinde altin fiyatinin etkileyen degiskenler verilmistir.Verilen degiskenlere bakilarak en yuksek en dusuk gibi bagimsiz degiskenlerin toplam ciroyu etkileyip etkilemedigini kontrol edilir.
Altin Fiyat verimiz 12 Degiskenli ve 1660 gozlemlidir.
Degiskenler :
Total.Turnover:Altin fiyatinin toplam cirosu
Open:Acilis fiyati
High:Altin fiyatinin en yuksek noktasi
Low:Altin fiyatinin en dusuk noktasi
Close:Kapanis fiyati
WAP:Hacim agirlikli ortalama fiyat
No..of.Shares:Pay sayisi
No..of.Trades:Islem Sayisi
Deliverable.Quantity:Teslim edilebilir miktar
X..Deli..Qty.to.Traded.Qty:Islem miktari
Spread.H.L:Altin fiyatinin yayilimi(H.L Sirketi)
Spread.C.O:Altin fiyatinin yayilimi(C.O Sirketi)
knitr::include_graphics("fallinggoldprice.jpg")
library(readr)
library(dplyr)
golddata=read.csv("C:/Users/Melike/Desktop/BSE-BOM590111.csv", header=T)
GoldData <- golddata%>%select(c("Total.Turnover","Open","High","Low","Close","WAP","No..of.Shares","No..of.Trades","Deliverable.Quantity","X..Deli..Qty.to.Traded.Qty"))
head(GoldData,10)
## Total.Turnover Open High Low Close WAP No..of.Shares No..of.Trades
## 1 5848 0.79 0.79 0.76 0.76 0.79 7430 7
## 2 244 0.79 0.79 0.79 0.79 0.79 310 4
## 3 62 0.83 0.83 0.83 0.83 0.83 75 1
## 4 913 0.87 0.87 0.87 0.87 0.87 1050 2
## 5 364 0.91 0.91 0.91 0.91 0.91 400 1
## 6 3303 0.91 0.91 0.90 0.91 0.91 3632 6
## 7 50 0.91 0.91 0.91 0.91 0.91 55 1
## 8 68 0.93 0.93 0.92 0.92 0.92 74 2
## 9 68 0.89 0.89 0.89 0.89 0.88 77 2
## 10 73 0.88 0.88 0.85 0.85 0.87 84 2
## Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
## 1 7430 100
## 2 310 100
## 3 75 100
## 4 1050 100
## 5 400 100
## 6 3632 100
## 7 55 100
## 8 74 100
## 9 77 100
## 10 84 100
summary(GoldData) #verimizi ozetleyelim
## Total.Turnover Open High Low
## Min. : 0 Min. : 0.500 Min. : 0.52 Min. : 0.50
## 1st Qu.: 7160 1st Qu.: 1.350 1st Qu.: 1.38 1st Qu.: 1.32
## Median : 108630 Median : 5.255 Median : 5.38 Median : 4.95
## Mean : 1664520 Mean :11.834 Mean :12.17 Mean :11.22
## 3rd Qu.: 2057775 3rd Qu.:11.393 3rd Qu.:11.94 3rd Qu.:11.25
## Max. :23830476 Max. :84.950 Max. :84.95 Max. :71.50
## Close WAP No..of.Shares No..of.Trades
## Min. : 0.520 Min. : 0.000 Min. : 1 Min. : 1.0
## 1st Qu.: 1.350 1st Qu.: 1.336 1st Qu.: 4393 1st Qu.: 14.0
## Median : 5.085 Median : 5.120 Median : 33871 Median : 72.5
## Mean :11.674 Mean :11.689 Mean :101330 Mean :123.0
## 3rd Qu.:11.377 3rd Qu.:11.452 3rd Qu.:153417 3rd Qu.:185.2
## Max. :74.550 Max. :74.972 Max. :849341 Max. :752.0
## Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
## Min. : 1 Min. : 4.76
## 1st Qu.: 4300 1st Qu.: 78.93
## Median : 31095 Median :100.00
## Mean : 75755 Mean : 88.68
## 3rd Qu.:122723 3rd Qu.:100.00
## Max. :631381 Max. :100.00
Verimizi oncelikle test ve train olarak ikiye ayiralim.
set.seed(2)
index<-sample(1:nrow(GoldData),round(nrow(GoldData)*0.85))
veritrain<-GoldData[index,]
veritest<-GoldData[-index,]
Regresyon modelimizi yanit degiskenimize (Total.Turnover) karsilik bagimsiz degiskenlerimizi (Open,High,Low,Close, WAP,No..of.Shares,No..of.Trades, Deliverable.Quantity,X..Deli..Qty.to.Traded.Qty) kullanarak kuralim.
lmod1<- lm(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty,data=veritrain)
summary(lmod1)
##
## Call:
## lm(formula = Total.Turnover ~ Open + High + Low + Close + WAP +
## No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty,
## data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6158078 -487930 57706 454578 13366606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.350e+06 3.113e+05 -10.761 < 2e-16 ***
## Open 3.341e+04 4.769e+04 0.701 0.48369
## High 4.174e+04 6.611e+04 0.631 0.52794
## Low 8.208e+04 6.267e+04 1.310 0.19051
## Close 2.483e+05 8.619e+04 2.881 0.00403 **
## WAP -2.912e+05 1.146e+05 -2.542 0.01112 *
## No..of.Shares 1.103e+01 1.210e+00 9.112 < 2e-16 ***
## No..of.Trades 4.616e+03 3.785e+02 12.195 < 2e-16 ***
## Deliverable.Quantity -4.595e+00 1.579e+00 -2.910 0.00368 **
## X..Deli..Qty.to.Traded.Qty 2.650e+04 3.195e+03 8.294 2.55e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1196000 on 1401 degrees of freedom
## Multiple R-squared: 0.8315, Adjusted R-squared: 0.8304
## F-statistic: 768.2 on 9 and 1401 DF, p-value: < 2.2e-16
Kurulan regresyon modelinin anlamliligina baktigimizda p value yaklasik 0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugundan HO red edilir yani kurulan model anlamlidir deriz.
Iki degisken arasi lineer iliskiyi gosterir.Eger bir aciklayici degisken ve bir diger aciklayici degiskenin veya degiskenlerin lineer bir kombinasyonlari ise bu durumda x transpoz x matrisi (X’X) singuler olur ve tersi alinamaz. Bu durumdan ilgili degiskenlerden biri modelden cikartilarak kurtulunur. Gozlem sayisi arttikca ic iliski durumu azalir.
Simdi Collinearity teshisi icin korelasyon matrisi, kosul indeksi ve vif e bakacagiz.
Simdi x in korelasyon matrisine bakalim. Bunun icin yanit degiskenini (Total.Turnover) veriden cikartmaliyiz.Geri kalanlarin korelasyon matrisine bakmaliyiz.
cor(veritrain[,-c(1)])
## Open High Low Close
## Open 1.0000000 0.9986117 0.9976764 0.9968616
## High 0.9986117 1.0000000 0.9976920 0.9985662
## Low 0.9976764 0.9976920 1.0000000 0.9988147
## Close 0.9968616 0.9985662 0.9988147 1.0000000
## WAP 0.9975843 0.9989584 0.9991246 0.9997287
## No..of.Shares 0.2038230 0.2028714 0.2027559 0.2040643
## No..of.Trades 0.6368979 0.6372438 0.6368978 0.6392304
## Deliverable.Quantity 0.2338235 0.2321991 0.2342762 0.2348234
## X..Deli..Qty.to.Traded.Qty -0.3522097 -0.3555572 -0.3467139 -0.3503717
## WAP No..of.Shares No..of.Trades
## Open 0.9975843 0.2038230 0.6368979
## High 0.9989584 0.2028714 0.6372438
## Low 0.9991246 0.2027559 0.6368978
## Close 0.9997287 0.2040643 0.6392304
## WAP 1.0000000 0.2025176 0.6382485
## No..of.Shares 0.2025176 1.0000000 0.5791851
## No..of.Trades 0.6382485 0.5791851 1.0000000
## Deliverable.Quantity 0.2332093 0.9641106 0.5739868
## X..Deli..Qty.to.Traded.Qty -0.3500533 -0.6084525 -0.5023000
## Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
## Open 0.2338235 -0.3522097
## High 0.2321991 -0.3555572
## Low 0.2342762 -0.3467139
## Close 0.2348234 -0.3503717
## WAP 0.2332093 -0.3500533
## No..of.Shares 0.9641106 -0.6084525
## No..of.Trades 0.5739868 -0.5023000
## Deliverable.Quantity 1.0000000 -0.4786068
## X..Deli..Qty.to.Traded.Qty -0.4786068 1.0000000
Korelasyon matrisine baktigimizda ornegin en yuksek Open ile High (aralarindaki korelasyon 0.9986117) bagimsiz degiskenlerinin iliskili oldugu gorulmektedir. Diger bagimsiz degiskenler arasinda da korelasyon oldukca yuksektir.
Bu korelasyona simdi korelasyon plotu ile bakalim.
library(corrplot)
corrplot(cor(veritrain[,-c(1)]),method = "pie", order="hclust")
Korelasyon plotunda Pozitif korelasyonlar mavi, negatif korelasyonlar kirmizi renkte gosterilir.
Korelasyon plotu ile korelasyon matrisimizin sonuclari ayni cikmistir.Bagimsiz degiskenlerin arasinda pozitif yonlu iliskinin yuksek oldugu gorulmektedir.(Mavinin tonuna gore en yuksek iliskiden en dusuk iliskiye gore koyu maviden aciga dogru gidiyor.)
Kappa degeri > 30 ise orta derece collinearity , Kappa degeri > 100 ise guclu collinearity oldugunu gosterir.
Kurulan regresyon modelimizi matrix haline donusturelim
x <- model.matrix(lmod1)[,-1] #yanit degiskeni Total.Turnover i cikardik
head(x,10)
## Open High Low Close WAP No..of.Shares No..of.Trades
## 975 5.84 5.84 5.84 5.84 5.839986 29360 10
## 710 1.75 1.75 1.74 1.74 1.739952 10102 5
## 774 2.02 2.02 2.02 2.02 2.019567 971 141
## 416 1.60 1.60 1.60 1.60 1.600000 20000 2
## 392 1.23 1.34 1.23 1.32 1.295786 14000 25
## 273 1.21 1.27 1.21 1.27 1.259369 47550 57
## 1373 19.50 20.75 19.50 20.35 20.311586 116645 89
## 1228 10.12 10.75 10.12 10.72 10.678524 7534 24
## 1405 26.35 26.35 26.35 26.35 26.349952 17871 20
## 1321 13.12 13.90 13.12 13.79 13.713660 31302 68
## Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
## 975 29360 100.00
## 710 10101 99.99
## 774 971 100.00
## 416 20000 100.00
## 392 14000 100.00
## 273 47550 100.00
## 1373 106327 91.15
## 1228 7534 100.00
## 1405 17871 100.00
## 1321 29858 95.39
Verimizden yanit degiskenini cikartip sadece aciklayici degiskenlerden olusan matrix formuna donusturuyoruz.
Simdi x transpoz x in eigen valuelerini hesaplayalim
e <- eigen(t(x)%*%x)$values
e
## [1] 6.048582e+13 6.467282e+11 2.309077e+07 6.439792e+06 1.054083e+06
## [6] 1.597005e+03 8.092111e+02 1.508346e+02 7.566758e+01
Kappa degeri : sqrt(en buyuk ozdeger / en kucuk ozdeger)
k <- sqrt(max(e)/min(e))
k
## [1] 894070.7
Sonucumuzda Kappa = 894070.7 > 30 oldugundan sonucumuza gore collinearity (ic iliski) problemi vardir.
Xi degiskenlerinin diger bagimsiz degiskenler ile regresyonundan elde edilen R kare degerlerinin yuksekligi Collinearity (ic iliski) nin varligini gosterir. Buna bagli gelistirilmis olcut VIF dir.
VIF degerinin 10 dan buyuk olmasi collinearity (ic iliski) probleminin oldugunu soyler.
Hazir kod ile vif degerlerine bakmak icin car paketi kullanilir.
library(car)
vif(lmod1)
## Open High
## 658.423881 1344.316877
## Low Close
## 1009.537156 2077.201819
## WAP No..of.Shares
## 3704.247966 27.402551
## No..of.Trades Deliverable.Quantity
## 2.643136 22.128530
## X..Deli..Qty.to.Traded.Qty
## 2.707769
Bagimsiz degiskenlerimiz icin hesaplatilan vif degerlerimiz 10 dan buyuk oldugundan bagimsiz degiskenler arasinda collinearity(ic iliski) problemi vardir deriz.
Bu uc tanimlama yontemi de bize bu veride collinearity problemi oldugunu isaret etmektedir.
Ridge regresyon EKK optimizasyon yontemine bir kisit getirir. Bu kisit ß katsayilarinin karelerinin toplaminin uzerinedir.
Lambda buyudukce ß parametreleri 0 a dogru yaklasir.Parametrelerin buyumesi parametre tahminlerinin varyansini dusurur. Negatif etkisi modele yanlilik katmasidir.
Var(ßsapka ridge)= σ^2 / (x’x + λI) burada λ yi buyuk tutarsak varyans kuculur ayni zamanda yanlilik artar. Ikisi arasinda denge kuracak sekilde λ belirlenmelidir. Multicollinearity problemini halledebilecek en kucuk λ yi belirlemeliyiz.
EKK tahmin edicisi yansiz bir tahmin edici iken Ridge Regresyonunun tahmin edicisi yanli bir tahmin ediciye donusur.Ic iliski durumlarinda varyanslar buyukken Ridge Regresyonunda varyanslar daha kucuktur.
Ridge Regresyon da aciklayici degiskenlerin tamamini modele dahil eder ve kompleks model olusmasini saglar. (Modele degisken ekledikce hata duser fakat kompleks hale de gelebilir.)
lambda<- 10^seq(-15, 9, length.out = 200) #lamda icin dizi olusturma
Ridge’de lambda parametresinin secimi icin en iyi yontem Cross Validationdur.Bu sebeple lambdalar icin oncelikle bir dizi olusturduk.
x<-as.matrix(veritrain[,-1])
head(x,10)
## Open High Low Close WAP No..of.Shares No..of.Trades
## 975 5.84 5.84 5.84 5.84 5.839986 29360 10
## 710 1.75 1.75 1.74 1.74 1.739952 10102 5
## 774 2.02 2.02 2.02 2.02 2.019567 971 141
## 416 1.60 1.60 1.60 1.60 1.600000 20000 2
## 392 1.23 1.34 1.23 1.32 1.295786 14000 25
## 273 1.21 1.27 1.21 1.27 1.259369 47550 57
## 1373 19.50 20.75 19.50 20.35 20.311586 116645 89
## 1228 10.12 10.75 10.12 10.72 10.678524 7534 24
## 1405 26.35 26.35 26.35 26.35 26.349952 17871 20
## 1321 13.12 13.90 13.12 13.79 13.713660 31302 68
## Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
## 975 29360 100.00
## 710 10101 99.99
## 774 971 100.00
## 416 20000 100.00
## 392 14000 100.00
## 273 47550 100.00
## 1373 106327 91.15
## 1228 7534 100.00
## 1405 17871 100.00
## 1321 29858 95.39
Train verimizi matrix haline donusturuyoruz (yanit degiskenini (Total.Turnover) cikardik)
Lambdanin farkli degerleri icin degiskenlerin aldigi farkli degerlerin grafigini cizdirelim; Ridge Regresyonda alpha = 0 degerini alir.
library(glmnet)
op = par(bg = "snow2")
ridgemodel<-glmnet(x,veritrain$Total.Turnover,alpha = 0,lambda = lambda)
plot(ridgemodel,xvar = "lambda")
Bu grafik lambdanin farkli degerleri icin degiskenlerin aldigi farkli degerlerin grafigidir.
Grafikte cizgilere baktigimizda cok buyuk degisim gosteren stabil olmayan degiskenler vardir.Bu degiskenin degeri digerlerine gore degisim gosteriyor. Bu multicollinearity nin bir etkisidir.
Grafigimizin ust bolumunde gozuken 9 degerleri Ridge Regresyonun hicbir bagimsiz degiskeni atmayip tamamini kullanmasindandir.
Grafikte gorulen her bir cizgi bir degiskene karsilik gelmektedir.
Bu grafik ile model katsayilarinin lambdaya gore nasil degistigini gosterdik. Simdi Cross Validation Yontemi ile optimal lambdayi belirleyelim.
Cross Validationda 9 fold ile model kurulur 10 uncu fold da bu modelin performansi incelenir MSE’ye bakilir.Herbir lambda degeri icin Cross Validation yapilir.Tum bulunan MSE ortalamalari alinir ve minimum RMSE degerini veren lambda degeri secilir.
op = par(bg = "mintcream")
cv.fitridge<-cv.glmnet(x,veritrain$Total.Turnover,alpha=0,lambda = lambda) #yukarida olusturulan lambda dizisi icin
plot(cv.fitridge)
Grafikte gozuken kirmizi noktalar her lambda degeri icin 10 folddan gelen MSE’lerin otalamasidir.
Grafikteki ilk dogru minimum MSE degerini veren lambda degerinin logaritmasini, ikinci dogru ise foldlardan elde edilen MSE degerlerinin standart sapmasinin 1 oldugu lambda degerinin logaritmasini gostermektedir.
Grafigimizin ust bolumunde gozuken 9 degerleri Ridge Regresyonunun tum degiskenleri kullanmasidir.
Simdi optimal lambda degerini kullanarak Ridge Regresyon modelimizi kuralim ve bu modelin test verisi uzerinde RMSE ve R kare degerlerini hesaplayalim.
Performans kiyaslamasi her zaman test veri seti uzerinden yapilir.Cunku multicollinearity nin train e bir etkisi yoktur ama teste etkisi vardir.
optimumlambda<-cv.fitridge$lambda.min #optimum lambda icin bunlar icerisinden min olan secilir.
optimumlambda
## [1] 14992.68
lambda_1SE<-cv.fitridge$lambda.1se
lambda_1SE
## [1] 1683180
Grafigimizde cikan cizgilerimizin yerine bakarsak ;
Ilk Dogru: log(λmin)=log(14992.68)= 9.615317
Ikinci Dogru : log(λ1SE)=log(1683180)= 14.3362
Ridge Regresyon modeli;
ridgemodel<-glmnet(x,veritrain$Total.Turnover,alpha=0,lambda=optimumlambda)
RMSE ve R kare hesaplayan fonksiyonlar;
rmse<-function(true, predicted,n) {sqrt(sum((predicted - true)^2)/n)}
ypredictedridge <- predict(ridgemodel, s = optimumlambda, newx = as.matrix(veritest[,-1]))# kurulan model uzerinden elde edilen tahmin degerleri
rsquare <- function(true, predicted) {
sse <- sum((predicted - true)^2)
sst <- sum((true - mean(true))^2)
rsq <- 1 - sse / sst
rsq }
Test verisi uzerinden Ridge modelinin R karesini hesaplatalim;
ridgerkare<-rsquare(veritest$Total.Turnover,ypredictedridge)
ridgerkare
## [1] 0.8349589
Ridge Regresyon icin test verisi uzerinden R kare 0.8349589 cikmistir.
Test verisi uzerinden Ridge modelinin RMSE sini hesaplatalim;
ridgermse<-rmse(veritest$Total.Turnover,ypredictedridge,length(veritest$Total.Turnover))
ridgermse
## [1] 1269120
Ridge Regresyon icin test verisi uzerinden RMSE 1269120 cikmistir.
AIC ve BIC performans degerlendirme kriterleridir. Modeller arasinda kiyas yaparken kullanilir. Genel olarak AIC ve BIC degerleri daha kucuk olan model diger modellere gore daha iyidir. AIC ve BICde artik degerler kullanilir.
Test verisi icin artiklar;
ridgeartik<-veritest$Total.Turnover-(ypredictedridge)
head(ridgeartik,10)
## 1
## 2 598822.3
## 5 598841.1
## 16 606202.5
## 30 602460.3
## 40 591278.7
## 51 615221.6
## 63 591965.4
## 66 590474.8
## 69 631728.2
## 72 458314.5
ridgeaic<-nrow(GoldData)*(log(2*pi)+1+log((sum((ridgeartik)^2)/nrow(GoldData))))+((length(ridgemodel$Total.Turnover)+1)*2)
ridgeaic
## [1] 48222.39
Ridge Regresyon icin AIC degeri 48222.39 cikmistir.
ridgebic<-nrow(GoldData)*(log(2*pi)+1+log((sum((ridgeartik)^2)/nrow(GoldData))))+((length(ridgemodel$Total.Turnover)+1)*log(nrow(GoldData)))
ridgebic
## [1] 48227.8
Ridge regresyon icin BIC degeri 48227.8 cikmistir.
Degisken secimi icin kullanilir. Lassoda katsayilarin bazilari ridgeden farkli olarak sifirlanmaktadir.
Lasso multicollinearity problemini cozerken ayni zamanda degisken secimi de yapabilme yetenegine sahiptir.Lassoda da λ ceza parametresi vardir. λ buyudukce modelden atilan (disarida birakilan) degisken sayimiz artiyor.
Bazen Lasso cok fazla degiskeni disarida birakmaktadir. Bu modelin tahmin performansini dusurur. Modelde gormek istedigimiz degiskeni goremeyebiliriz.
Ilk olarak Cross Validation ile Optimal Lambda degerini belirleyelim. Lasso Regresyonunda alpha = 1 degerini alir.
op = par(bg = "lavender")
cv.fitlasso<-cv.glmnet(x,veritrain$Total.Turnover,alpha=1,lambda = lambda) #lambda dizisinde belirledigimiz lamdalar kullanilir.
plot(cv.fitlasso)
Grafigimizin ust bolumunde gozuken degerler Lasso Regresyonunun bagimsiz degiskenlerinin tamamini kullanmayarak modelden atmasindan dolayi kaynaklanir.
Grafikte gozuken kirmizi noktalar her lambda degeri icin 10 folddan gelen MSE’lerin ortalamasidir.
Grafikteki ilk dogru minimum MSE degerini veren lambda degerinin logaritmasini, ikinci dogru ise foldlardan elde edilen MSE degerlerinin standart sapmasinin 1 oldugu lambda degerinin logaritmasini gostermektedir.
Optimal Lambda degeri;
optimallambda<-cv.fitlasso$lambda.min
optimallambda
## [1] 1231.551
lambda_1SE<-cv.fitlasso$lambda.1se
lambda_1SE
## [1] 240940.4
Grafigimizde cikan cizgilerimizin yerine bakarsak ;
Ilk Dogru: log(λmin)=log(1231.551)= 7.11603
Ikinci Dogru : log(λ1SE)=log(240940.4)= 12.3923
Simdi Optimal Lambda degerini kullanarak Lasso Regresyon modelimizi kuralim ve bu modelin test verisi uzerinde RMSE ve R kare degerlerini hesaplayalim.
Lasso Regresyon modeli;
lassomodel<-glmnet(x,veritrain$Total.Turnover,alpha=1,lambda=optimallambda)
coef(lassomodel)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -3.262259e+06
## Open 1.932924e+04
## High 2.109307e+04
## Low 4.099178e+04
## Close 3.057947e+04
## WAP .
## No..of.Shares 1.055970e+01
## No..of.Trades 4.604948e+03
## Deliverable.Quantity -3.901827e+00
## X..Deli..Qty.to.Traded.Qty 2.564554e+04
9 degiskenli bir modeldi.Lasso Regresyonu WAP degiskenini atarak 8 degisken ile calismistir.Bu degiskene iliskin katsayilar sifirlanmistir.
Simdi test verisi uzerinden Lasso Regresyon modelimizin performansina bakalim.
Kurulan model uzerinden elde edilen tahmin degerleri;
ypredictedlasso <- predict(lassomodel, s = optimallambda, newx = as.matrix(veritest[,-1]))
head(ypredictedlasso,10)
## 1
## 2 -588746.9
## 5 -588523.3
## 16 -596053.7
## 30 -592206.5
## 40 -580584.1
## 51 -605166.3
## 63 -580916.6
## 66 -580332.7
## 69 -621878.8
## 72 -439280.2
Test verisi icin Lasso Regresyon modelinin R karesi;
lassorkare<-rsquare(veritest$Total.Turnover,ypredictedlasso)
lassorkare
## [1] 0.8356641
Test verisi icin Lasso Regresyon Modelinin R karesi 0.8356641 cikmistir.
Test verisi icin Lasso Regresyon Modelinin RMSE si;
lassormse<-rmse(veritest$Total.Turnover,ypredictedlasso,length(veritest$Total.Turnover))
lassormse
## [1] 1266406
Test verisi icin Lasso Regresyon Modelinin RMSE si 1266406 cikmistir.
Lasso Regresyon Modeli icin artiklar;
lassoartik<-veritest$Total.Turnover-(ypredictedlasso)
head(lassoartik,10)
## 1
## 2 588990.9
## 5 588887.3
## 16 596147.7
## 30 592219.5
## 40 581546.1
## 51 605233.3
## 63 583016.6
## 66 581682.7
## 69 621941.8
## 72 455131.2
Lassonun artiklari uzerinden AIC;
lassoaic<-nrow(GoldData)*(log(2*pi)+1+log((sum((lassoartik)^2)/nrow(GoldData))))+((length(lassomodel$Total.Turnover)+1)*2)
lassoaic
## [1] 48215.28
Lasso Regresyon icin AIC degeri 48215.28 cikmistir.
Lassonun artiklari uzerinden BIC;
lassobic<-nrow(GoldData)*(log(2*pi)+1+log((sum((lassoartik)^2)/nrow(GoldData))))+((length(lassomodel$Total.Turnover)+1)*log(nrow(GoldData)))
lassobic
## [1] 48220.69
Lasso regresyon icin BIC degeri 48220.69 cikmistir.
Elatic net Ridge Regresyon Modeli ile Lasso Regresyon Modelinin bir kombinasyonudur.Ridge tarzi cezalandirma ve Lasso tarzi degisken secimi yapar. Lasso ve Ridge’de bulunan λ parametresi haricinde ikinci bir parametre olan α parametresi de vardir. Ozellikle yuksek korelasyonlu degisken gruplari oldugunda onerilir.
Ilk olarak Cross Validation ile Optimal Lamda degerini belirleyelim. λ= 0.5 alindiginda Elastic Net Regresyon Modeli elde edilir.
op = par(bg = "lemonchiffon")
cv.fitelasticnet<-cv.glmnet(x,veritrain$Total.Turnover,alpha=0.5,lambda = lambda)
plot(cv.fitelasticnet)
Grafigimizin ust bolumunde gozuken degerler Elastic Net Regresyonunun bagimsiz degiskenlerinin tamamini kullanmayarak modelden atmasindan dolayi kaynaklanir.
Grafikte gozuken kirmizi noktalar her lambda degeri icin 10 folddan gelen MSE’lerin ortalamasidir.
Grafikteki ilk dogru minimum MSE degerini veren lambda degerinin logaritmasini, ikinci dogru ise foldlardan elde edilen MSE degerlerinin standart sapmasinin 1 oldugu lambda degerinin logaritmasini gostermektedir.
Optimal Lambda degeri;
optlambda<-cv.fitelasticnet$lambda.min
optlambda
## [1] 2833.096
lambda_1SE<-cv.fitelasticnet$lambda.1se
lambda_1SE
## [1] 419870.7
Grafigimizde cikan cizgilerimizin yerine bakarsak ;
Ilk Dogru: log(λmin)=log(2833.096)= 7.949125
Ikinci Dogru : log(λ1SE)=log(419870.7)= 12.9477
Simdi Optimal Lambda degerini kullanarak Elastic Net Regresyon Modelimizi kuralim ve bu modelin test verisi uzerinde RMSE ve R kare degerlerini hesaplayalim.
elasticmodel<-glmnet(x,veritrain$Total.Turnover,alpha=0.5,lambda=optlambda)
coef(elasticmodel)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -3.218563e+06
## Open 1.889485e+04
## High 2.371953e+04
## Low 4.023749e+04
## Close 2.878726e+04
## WAP .
## No..of.Shares 1.030544e+01
## No..of.Trades 4.607224e+03
## Deliverable.Quantity -3.572329e+00
## X..Deli..Qty.to.Traded.Qty 2.518785e+04
9 degiskenli modelimizde Elastic Net Regresyonu WAP degiskenini atarak 8 degisken ile calismistir.Bu degiskene iliskin katsayilar sifirlanmistir.
Simdi test verisi uzerinden Elastic Net Regresyon Modelimizin performansina bakalim.
Kurulan model uzerinden elde edilen tahmin degerleri;
ypredictedelasticnet <- predict(elasticmodel, s = optlambda, newx = as.matrix(veritest[,-1]))
head(ypredictedelasticnet,10)
## 1
## 2 -591066.9
## 5 -590885.9
## 16 -598420.4
## 30 -594594.5
## 40 -582890.6
## 51 -607506.4
## 63 -583002.3
## 66 -582423.3
## 69 -624164.8
## 72 -439729.3
Test verisi icin Elastic Net Regresyon Modelinin R karesi;
elasticrkare<-rsquare(veritest$Total.Turnover,ypredictedelasticnet)
elasticrkare
## [1] 0.8357445
Test verisi icin Elastic Net Regresyon Modelinin R karesi 0.8357445 cikmistir.
Test verisi icin Elastic Net Regresyon Modelinin RMSE si;
elasticrmse<-rmse(veritest$Total.Turnover,ypredictedelasticnet,length(veritest$Total.Turnover))
elasticrmse
## [1] 1266096
Test verisi icin Elastic Net Regresyon Modelinin RMSE si 1266096 cikmistir.
Elastic Net Regresyon Modeli icin artiklar;
elasticartik<-veritest$Total.Turnover-(ypredictedelasticnet)
head(elasticartik,10)
## 1
## 2 591310.9
## 5 591249.9
## 16 598514.4
## 30 594607.5
## 40 583852.6
## 51 607573.4
## 63 585102.3
## 66 583773.3
## 69 624227.8
## 72 455580.3
Elastic Net in artiklari uzerinden AIC;
elasticaic<-nrow(GoldData)*(log(2*pi)+1+log((sum((elasticartik)^2)/nrow(GoldData))))+((length(elasticmodel$Total.Turnover)+1)*2)
elasticaic
## [1] 48214.47
Elastic Net Regresyon icin AIC degeri 48214.47 cikmistir.
Elastic Net in artiklari uzerinden BIC;
elasticbic<-nrow(GoldData)*(log(2*pi)+1+log((sum((elasticartik)^2)/nrow(GoldData))))+((length(elasticmodel$Total.Turnover)+1)*log(nrow(GoldData)))
elasticbic
## [1] 48219.88
Elastic Net Regresyon icin BIC degeri 48219.88 cikmistir.
Simdi kurulan modellerde AIC, BIC, RMSE ve R kare degerlerini karsilastirip model secimi yapalim.
tablo<-matrix(c(ridgeaic,ridgebic,ridgermse,ridgerkare,
lassoaic,lassobic,lassormse,lassorkare,
elasticaic,elasticbic,elasticrmse,elasticrkare),3,4,byrow = TRUE)
row.names(tablo)<-c("Ridge","Lasso","Elasticnet")
colnames(tablo)<-c("AIC","BIC","RMSE","Rkare")
tablo
## AIC BIC RMSE Rkare
## Ridge 48222.39 48227.80 1269120 0.8349589
## Lasso 48215.28 48220.69 1266406 0.8356641
## Elasticnet 48214.47 48219.88 1266096 0.8357445
Secim yaparken AIC BIC ve RMSE degerleri kucuk , R kare degeri buyuk olan model secilmelidir.
Modellerin R2 degerlerini, RMSE degerlerini, AIC, BIC degerlerini karsilastirdigimizda;
En kucuk rmse degeri Elasticnet Regresyon Modelinde(1266096), En kucuk AIC degeri Elasticnet Regresyon Modelinde(48214.47), En kucuk BIC degeri Elasticnet Regresyon Modelinde(48219.88), En buyuk R kare degeri Elasticnet Regresyon Modelinde(0.8357445) cikmistir. Bu sebeple calisabilecegimiz en iyi regresyon modeli Elastic Net Regresyon Modelidir. Bu veriyi modellemede Elastic Net Regresyon Modeli daha uygundur.
Multicollinearity nin cozumunde Ridge,Lasso,Elastic Net haricinde Temel Bilesenler regresyonu da kullanilir. Simdi multicollinearity probleminden Temel Bilesenler Yolu ile kurtulmayi deneyelim.
Veri setindeki tum degiskenler vektoru ifade eder. Degiskenler arasinda iliski olmadiginda bu vektorler birbirlerine diktir.Temel bilesenler analizinin amaci, X matrisine bir donusum uygulayarak birbirlerine dik (ortogonal) vektorlerden olusan bir sistem elde etmektir. Yuksek boyutlu verilerde dusuk boyutlu dogrusal yapi elde etmek icin kullanilan bir yontemdir. Vektor boyutlari kisa olanlar goz ardi edilir.
set.seed(2)
index<-sample(1:nrow(GoldData),round(nrow(GoldData)*0.85))
veritrain<-GoldData[index,]
veritest<-GoldData[-index,]
Oncelikle Train veri seti uzerinde kurdugumuz EKK Regresyon Modelini kullanarak, hem train hem de test veri seti uzerinde RMSE degerlerini hesaplayalim.
EKK Regresyon Modelimiz;
lmod1 <- lm(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty,data=veritrain)
summary(lmod1)
##
## Call:
## lm(formula = Total.Turnover ~ Open + High + Low + Close + WAP +
## No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty,
## data = veritrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6158078 -487930 57706 454578 13366606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.350e+06 3.113e+05 -10.761 < 2e-16 ***
## Open 3.341e+04 4.769e+04 0.701 0.48369
## High 4.174e+04 6.611e+04 0.631 0.52794
## Low 8.208e+04 6.267e+04 1.310 0.19051
## Close 2.483e+05 8.619e+04 2.881 0.00403 **
## WAP -2.912e+05 1.146e+05 -2.542 0.01112 *
## No..of.Shares 1.103e+01 1.210e+00 9.112 < 2e-16 ***
## No..of.Trades 4.616e+03 3.785e+02 12.195 < 2e-16 ***
## Deliverable.Quantity -4.595e+00 1.579e+00 -2.910 0.00368 **
## X..Deli..Qty.to.Traded.Qty 2.650e+04 3.195e+03 8.294 2.55e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1196000 on 1401 degrees of freedom
## Multiple R-squared: 0.8315, Adjusted R-squared: 0.8304
## F-statistic: 768.2 on 9 and 1401 DF, p-value: < 2.2e-16
Kurulan regresyon modelinin anlamliligina baktigimizda p value degerimiz yaklasik=0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugundan H0 red edilir yani kurulan model anlamlidir deriz.
Simdi train ve test veri seti uzerinde RMSE degerlerini hesaplayalim.
rmse <- function(x,y) sqrt(mean((x-y)^2))
rmse(predict(lmod1), veritrain$Total.Turnover)
## [1] 1191567
Train veri seti uzerinden kurulan regresyon modelinin RMSE degeri 1191567 dir.
rmse(predict(lmod1, veritest), veritest$Total.Turnover)
## [1] 1268034
Test veri seti uzerinden kurulan regresyon modelinin RMSE degeri 1268034 dir.
Iki rmse degeri arasinda fark olmasinin sebebi multicollinearity nin varliginin gostergesidir.
par(mfrow=c(1,3),op = par(bg = "ivory1"))
plot(High~Open, veritrain)
plot(Low~High, veritrain)
plot(WAP~Close, veritrain)
Bazi aciklayici degiskenler arasinda sacinim grafigi cizdirdik. Bunlar arasinda lineer iliski goruluyor. Bu da multicollinearity’nin varliginin bir gostergesidir.
Simdi Temel Bilesenler Regresyonunu yapalim.
Simdi tum componentleri kullanarak bir Temel Bilesenler Regresyon Modeli kuralim.
library(pls)
pcrmodel <- pcr(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty,data=veritrain,scale=T)
summary(pcrmodel)
## Data: X dimension: 1411 9
## Y dimension: 1411 1
## Fit method: svdpc
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 65.67 89.82 96.07 99.70 99.93 99.97 99.99
## Total.Turnover 77.65 79.98 82.38 82.52 83.05 83.05 83.05
## 8 comps 9 comps
## X 100.00 100.00
## Total.Turnover 83.06 83.15
Ciktinin ilk satiri componentlerin X matrisindeki aciklayicilik oranlarini, ikinci satir ise yanit degiskenindeki aciklayicilik miktarlarini gostermektedir.
Verimizde 9 adet vektor vardi.Dokuzuncu component ile yanittaki degiskenligin yaklasik %83.15 i aciklanabilmektedir.
Train ne kadar cok componentle calisirsa o kadar iyi sonuc alinir. Cunku multicollinearity nin train uzerinde etkisi yoktur fakat test verisi uzerinde etkisi vardir.
op = par(bg = "seashell")
validationplot(pcrmodel,val.type = "RMSE",col="green")
Bu grafik bize her componente karsilik gelen RMSE degerlerini gosterir. Train veri seti uzerinden cizilen grafikte component sayisi arttikca RMSE degerleri dusmektedir.
En buyuk degisim intercepten birinci componente geciste yasanmistir.Ucuncu componentten sonra bir degisim olmamistir.Toplamda intercept ile birlikte 9 component var.
Bu grafige iliskin RMSE degerleri;
pcrmse <- RMSEP(pcrmodel)
pcrmse
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 2902924 1372297 1299012 1218556 1213791 1195298
## 6 comps 7 comps 8 comps 9 comps
## 1195267 1195173 1194796 1191567
RMSE degeri component sayisi arttikca azalmaktadir. En dusuk RMSE degeri ilk olarak birinci componentte gorulmektedir.
Simdi RMSE icin yaptiklarimizi R kare icinde yapalim.
op = par(bg = "thistle1")
validationplot(pcrmodel,val.type = "R2",col="orangered3")
Bu grafik bize her componente karsilik gelen R kare degerlerini gosterir. Train veri seti uzerinden cizilen grafikte component sayisi arttikca R kare degerleri artmalidir.
En buyuk degisim sifirinci intercepten birinci componente geciste yasanmistir.Ucuncu componentten sonra bir degisim olmamistir.Toplamda intercept ile birlikte 9 component vardir.
Multicollinearity icin train veri setine bakmak uygun olmadigindan test veri setine bakmak daha uygun olacaktir.
Test seti uzerinde component sayilarina gore RMSE degerlerini de RMSEP fonksiyonunu kullanarak bulabiliriz.
pcrmse <- RMSEP(pcrmodel,newdata=veritest)
pcrmse
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 3129122 1389048 1377986 1277119 1289373 1266618
## 6 comps 7 comps 8 comps 9 comps
## 1267766 1268591 1279188 1268034
Test veri seti uzerinden baktigimizda performans acisindan en kucuk RMSE degeri 1266618 ile besinci componenttedir.
Simdi 5 component ile kurulan modelin katsayilarina bakalim.
coef(pcrmodel,ncomp=5)
## , , 5 comps
##
## Total.Turnover
## Open 388365.5
## High 375486.6
## Low 374607.6
## Close 374205.6
## WAP 373758.2
## No..of.Shares 1523753.3
## No..of.Trades 631089.6
## Deliverable.Quantity -431002.0
## X..Deli..Qty.to.Traded.Qty 438473.4
Bu cikti bize pcr 5 component icin modelin yanit degiskeni tahminlerini verir.
Simdi bu modeli kullanarak train ve test veri seti uzerinden RMSE degerlerini hesaplayalim.
rmse(predict(pcrmodel, ncomp=5), veritrain$Total.Turnover)
## [1] 1195298
Train veri seti uzerinden RMSE degeri 1195298 dir.
rmse(predict(pcrmodel, veritest, ncomp=5), veritest$Total.Turnover)
## [1] 1266618
Test veri seti uzerinden RMSE degeri 1266618 dir.
Tahmin performansimiz hala cok iyi degildir ama parametre tahminlerimiz daha stabil cikmistir.Optimal component sayisini belirlerken test verisi uzerindeki performansa baktik. Bu is icin Cross Validation yonteminin kullanilmasi aslinda daha dogru bir yaklasim olacaktir.
Kac component olmasi gerektigini daha iyi verecek yontemdir Diger yontemlerde her yapista farkliliklar olmaktadir.
Train veri seti uzerinden Cross Validation hesaplatalim;
set.seed(2)
pcrmodel1 <- pcr(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty,data=veritrain,scale=T,validation="CV")
pcrCV <- RMSEP(pcrmodel1, estimate="CV")
pcrCV
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 2904983 1375744 1305215 1225696 1222286 1204988
## 6 comps 7 comps 8 comps 9 comps
## 1209815 1213585 1219082 1217670
RMSE degeri en dusuk besinci componentte(1204988) cikmistir.
op = par(bg = "snow")
plot(pcrCV,main="",col="red")
Cross Validation icin componentlere karsilik RMSE degerlerinin grafigi yukarida gosterilmektedir.
pcrcv icin en kucuk RMSE degerini alan componentin hangisi olduguna bakmak icin;
which.min(pcrCV$val)
## [1] 6
Grafikte Cross Validated RMSE degerleri vardir.10 fold cross validation ile pcrCV degerlerinin altincisi yani besinci component’e karsilik gelen RMSE degeri minimum cikti. (intercept+5 component)
coef(pcrmodel1,ncomp=5)
## , , 5 comps
##
## Total.Turnover
## Open 388365.5
## High 375486.6
## Low 374607.6
## Close 374205.6
## WAP 373758.2
## No..of.Shares 1523753.3
## No..of.Trades 631089.6
## Deliverable.Quantity -431002.0
## X..Deli..Qty.to.Traded.Qty 438473.4
Bu cikti bize pcrCV 5 component icin modelin yanit degiskeni tahminlerini verir. Yani tahmin performansimiz besinci componentte en iyi olmaktadir.
Datayi golddata alarak modelimizi kuralim.
model1 <- lm(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty,data=golddata)
summary(model1)
##
## Call:
## lm(formula = Total.Turnover ~ Open + High + Low + Close + WAP +
## No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty,
## data = golddata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6354408 -488882 82256 441551 13353426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.436e+06 2.822e+05 -12.177 < 2e-16 ***
## Open -1.561e+04 4.190e+04 -0.373 0.70946
## High 1.438e+05 5.449e+04 2.640 0.00838 **
## Low 1.592e+05 5.671e+04 2.807 0.00506 **
## Close 1.665e+05 8.109e+04 2.053 0.04020 *
## WAP -3.365e+05 1.065e+05 -3.159 0.00161 **
## No..of.Shares 1.111e+01 1.111e+00 10.001 < 2e-16 ***
## No..of.Trades 4.307e+03 3.426e+02 12.572 < 2e-16 ***
## Deliverable.Quantity -4.674e+00 1.460e+00 -3.203 0.00139 **
## X..Deli..Qty.to.Traded.Qty 2.747e+04 2.898e+03 9.479 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1204000 on 1650 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8321
## F-statistic: 914.7 on 9 and 1650 DF, p-value: < 2.2e-16
Regresyon modelimizin ciktisina baktigimizda p value degerimiz yaklasik 0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugu icin kurulan modelimiz anlamlidir.
fit<- fitted(model1)
head(fit,10) #tahmin degerleri
## 1 2 3 4 5 6 7 8
## -528051.3 -577041.0 -586779.4 -571499.3 -575295.2 -534543.6 -577516.4 -570630.7
## 9 10
## -572050.5 -582949.8
resid <-residuals(model1)
head(resid,10) #artiklar
## 1 2 3 4 5 6 7 8
## 533899.3 577285.0 586841.4 572412.3 575659.2 537846.6 577566.4 570698.7
## 9 10
## 572118.5 583022.8
Tum normallik testlerini (Shapiro-Wilk,Kolmogorov-Smirnov,Cramer-von Mises,Anderson-Darling) bir arada gormek icin asagidaki kodu kullanmaliyiz;
library(olsrr)
ols_test_normality(model1)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.7734 0.0000
## Kolmogorov-Smirnov 0.225 0.0000
## Cramer-von Mises 140.2225 0.0000
## Anderson-Darling 86.7059 0.0000
## -----------------------------------------------
Normallik testlerimizin p-valuelerine baktigimizda 0.05’ten kucuk oldugunu goruyoruz yani artiklarimizin dagilimi normal degildir deriz.
op = par(bg = "mintcream")
x <-residuals(model1)
#Artiklarin histogrami;
histogram <-hist(x, breaks=10, density=10,col="darkgrey",xlab="Residuals", main="Histogram")
abline(v=mean(x), col="darkgreen", lwd=2)
#Yogunluk egrisi cizme;
multiplier <- histogram$counts / histogram$density
mydensity <- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity,col="blue", lwd=2)
#Normal egrisisin ayni ortalama ve standart sapma ile cizilmesi;
xfit <- seq (min(x), max(x), length=40)
yfit <- dnorm(xfit, mean =mean(x), sd = sd(x))
yfit <- yfit *diff(histogram$mids[1:2]) *length(x)
lines(xfit, yfit, col="red", lwd=2)
Histogram grafigimize baktigimizda kirmizi cizgi bize normal dagilim egrisini verir.Mavi olan cizgi ise sinif orta noktalarindan gecen diyagramdir.Mavi ve kirmizi cizgilerimiz birbirine benzemedigi icin verinin normal dagilmadigini soyleyebiliriz.
#QQ-plot;
op = par(bg = "mistyrose")
qqnorm(residuals(model1),ylab="residuals",main="Q-Q PLOT",col="green")
qqline(residuals(model1),col="pink")
Q-Q Plot grafigimize baktigimizda verimiz Q-Q cizgisi uzerinde olmadigi icin normal dagilim varsayiminin saglanmadigi gorulmektedir.
#Density;
op = par(bg = "thistle1")
d <- density(x)
plot(d,main = "Yogunluk Grafigi")
polygon(d, col="red", border="violet")
library(plotly)
p <- ggplot(golddata, aes(x)) +
geom_histogram(aes(y = ..density..), alpha = 0.7,bins = 60, fill = "#FF00FF") +
geom_density(fill = "#FFFF66", alpha = 0.5) +
theme(panel.background = element_rect(fill = '#99FFFF')) +
ggtitle("Density with Histogram overlay")
fig <- ggplotly(p)
fig
#Interaktif density plotun gorseli;(Html ciktisini pdf e cevirince gozukmedigi icin ekledik)
knitr::include_graphics("yogunlukgrafigi.png")
Yogunluk Grafigimize baktigimizda sag kuyruk daha uzun oldugu icin grafigimiz sola carpiktir deriz.
Sabit varyansliligin en kullanisli teshis yontemi artiklara (residuals(lmod)) karsilik tahmin (fitted(lmod)) degerlerinin plotlanmasidir.
library(ggplot2)
ggplot(data=golddata,mapping=aes(x=fit,y=resid))+
geom_jitter(color="purple")+
geom_hline(yintercept=0,color="orange")+ggtitle("RESID & FITTED ")+xlab(" FITTED ")+ylab("RESID")
#Interaktif bir sekilde gorsellestirelim ;
p <- plot_ly(veritrain, x = fit, y = resid, alpha = 0.3)
subplot(
add_markers(p, size = 2, name = "default"),
add_markers(p, size = 2, sizes = c(1, 205), name = "custom"))
#Interaktif plotun gorseli;(Html ciktisini pdf e cevirince gozukmedigi icin ekledik)
knitr::include_graphics("varsayimkontrolu2.png")
Cizdirdigimiz grafigimiz bize duzgun bir sekil vermedigi icin sabit varyansli olup olmadigina emin olamiyoruz. Bu yuzden degisken varyanslilik testlerine bakmaliyiz.
H0:Heterosce Dosticity (Degisken Varyanslilik) problemi yok. H1:Heterosce Dosticity (Degisken Varyanslilik) problemi vardir.
#install.packages("lmtest")
library(lmtest)
bptest(model1,data=golddata)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 313.32, df = 9, p-value < 2.2e-16
BREUSCH-PAGAN Testimizin sonucuna gore p-value degerimiz yaklasik 0 cikmistir.P-Value degerimiz 0.05’ten kucuk oldugu icin H0 hipotezi reddedilir yani Heteroscedosticity (degisken varyanslilik) problemi vardir deriz.
H0: Otokorelasyon yoktur. H1: Otokorelasyon vardir.
library(car)
library(lmtest)
dwtest(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty ,data=golddata)
##
## Durbin-Watson test
##
## data: Total.Turnover ~ Open + High + Low + Close + WAP + No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty
## DW = 0.87314, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Sonucumuza gore p-value degerimiz yaklasik 0 cikmistir.P-value degerimiz 0.05’ten kucuk oldugu icin HO reddedilir.Bu nedenle Otokorelasyon vardir deriz.
Model tarafindan iyi tahmin edilemeyen gozlemlere denir.(Hatalara buyuk olan gozlemlere denir.)
Simdi kurulan regresyon modelimizin grafiklerini inceleyelim;
Oncelikle modelimizdeki supheli outlier icin varsayimlarimiza bakalim;
ols_plot_resid_stud_fit(model1)
Cizdirdigimiz grafikte kirmizi olan gozlemler supheli outlier degerlerimizdir.Bu grafik sadece varsayim yapmaktadir.Bu sebeple oncelikle library(faraway) ile plot cizdirerek modelimizdeki outlier suphesi olan gozlemlere bakalim.
library(faraway)
op = par(bg = "mintcream")
plot(model1)
1.Grafik icin; Artiklara karsi cizdirdigimiz grafige baktigimizda varyanslarin homojen olmadigi gorulmektedir.Outlier suphesi olan gozlemlerimiz 1454,1458,1461.gozlem cikmistir. 2.Grafik icin; Normal Q-Q Plot Grafigimize baktigimizda verimiz Q-Q Plot cizgisi uzerinde olmadigindan normal dagilim varsayiminin saglanmadigi gorulmektedir.Outlier suphesi olan gozlemlerimiz 1454,1458,1461.gozlem cikmistir. 3.Grafik icin; Standartlastirilmis artiklarin karekokune karsi cizdirdigimiz grafige baktigimizda varyanslarin homojen olmadigi gorulmektedir.Outlier suphesi olan gozlemlerimiz 1454,1458,1461.gozlem cikmistir. 4.Grafik icin; Cooks Distance’a gore grafikte cikan degerler : 1413,1419,1423.gozlemlere outlier suphesi ile yaklasilir
Bu grafiklere baktigimizda 1413,1419,1423,1454,1458,1461. gozlemler muhtemelen sorunlu olarak tanimlayabiliriz.
Verimizde bu gozlemlerin hangi durumlari temsil ettigini gormek icin bu gozlemlere bakmaliyiz;
golddata[c(1413,1419,1423,1454,1458,1461), ]
## Date Open High Low Close WAP No..of.Shares No..of.Trades
## 1413 2010-11-25 58.90 67.50 53.50 65.45 60.78086 249813 387
## 1419 2010-11-16 80.00 80.00 64.00 64.60 66.11543 259030 229
## 1423 2010-11-10 72.00 84.00 67.45 68.75 69.67889 243074 442
## 1454 2010-09-28 55.10 56.90 51.05 55.15 54.78093 326230 474
## 1458 2010-09-22 54.80 55.65 53.25 54.50 54.59983 436457 509
## 1461 2010-09-17 59.95 59.95 50.00 53.75 54.08437 357340 582
## Total.Turnover Deliverable.Quantity X..Deli..Qty.to.Traded.Qty Spread.H.L
## 1413 15183849 179080 71.69 14.00
## 1419 17125879 170574 65.85 16.00
## 1423 16937126 146724 60.36 16.55
## 1454 17871182 218875 67.09 5.85
## 1458 23830476 336060 77.00 2.40
## 1461 19326507 240191 67.22 9.95
## Spread.C.O
## 1413 6.55
## 1419 -15.40
## 1423 -3.25
## 1454 0.05
## 1458 -0.30
## 1461 -6.20
Modelimiz icin standartlastirilmis artiklarimizi bakacak olursak;
stud <- rstudent(model1)
head(stud,10)
## 1 2 3 4 5 6 7 8
## 0.4435703 0.4796336 0.4875780 0.4755846 0.4782851 0.4468542 0.4798706 0.4741627
## 9 10
## 0.4753420 0.4844058
Simdi standartlastirilmis artiklarimizin mutlakca en buyugune bakmaliyiz.
stud[which.max(abs(stud))]
## 1458
## 11.59116
En buyuk standartlastirilmis artik (rstudent) degerimiz 1458. gozlem olup degeri mutlakca 11.59116 dir.Fakat aykiri gozlem midir?
Benferonni duzelltmesi yaparsak; Simdi cut point belirlemeliyiz.
cut point degeri alfa/2n dir. rstudentler (n-p-1) serbestlik dereceli t-dagilimina sahiptir. (p:degisken sayisi +1 , n : gozlem sayisi)
qt(0.05/(length(stud)*2) , (length(golddata$Total.Turnover)-10-1)) #(alfa/2n),(n-p-1)
## [1] -4.184228
Burada en buyuk standartlastirilmis mutlak artik degerini 11.59116 olarak bulmustuk. |11.59116| >|-4.184228| oldugundan 1458. gozlem bizim icin outlierdir.
Bunu hazir kod ile yaptigimizda;
library(car)
outlierTest(model1)
## rstudent unadjusted p-value Bonferroni p
## 1458 11.591158 6.3747e-30 1.0582e-26
## 1461 7.872113 6.2675e-15 1.0404e-11
## 1454 7.316053 3.9691e-13 6.5888e-10
## 1464 7.016401 3.3119e-12 5.4977e-09
## 1400 6.097889 1.3360e-09 2.2177e-06
## 1419 6.041481 1.8836e-09 3.1268e-06
## 1659 -5.401962 7.5537e-08 1.2539e-04
## 1444 5.247341 1.7430e-07 2.8935e-04
## 1428 5.239411 1.8183e-07 3.0184e-04
## 1425 4.980403 7.0105e-07 1.1637e-03
En buyuk aykiri deger 1458 gozleme ait olmakla beraber diger outlier degerler ise 1461,1454,1464,1400,1419,1659,1444,1428,1425. degerlere aittir. Veri setimizde aykiri gozlemler mevcuttur.
Varsayimlar gerceklesmediginde (hatalar normal dagilmadiginda)ve aykiri gozlemler oldugunda ROBUST REGRESYON yontemi kullanilabilir.
Tum regresyon varsayimlari gecerli oldugunda , dogrusal regresyon icin normal EKK tahminleri en uygunudur.Bu varsayimlardan bazilari gecersiz oldugunda , EKK regresyonu kotu performans gosterebilir.
Aykiri gozlemler regresyon dogrusunu kendine gore kendine dogru ceker.Bu sekilde parametre tahminlerini olmasi gerektigi yerden cok uzaga tasiyabilir.Model uzerinde etkileri diger gozlemlerden daha fazla olur.Bu durum bizim model tahmin performansimizi dusurur.Bu durumda robust regresyon kullanilir.
Bu gozlemlerin model uzerinde etkisini dusunerek agirliklandirma yapiyoruz.
Birden cok aykiri gozlem varsa modele bundan etkilenebiliyor(maskeleme-swamping).Bu sekilde kurulan modelde yanlis olmus oluyor ve bu modelin artiklari uzerinden yorum yapmak da cok dogru olmuyor.
Aykiri gozlem calismasi yapilacaksa Robust Regresyon Modeli kurup bu regresyon modelinin artiklari uzerinden konusmak daha dogru olacaktir.
Robust Regresyon iteratif yeniden agirlikli En Kucuk Kareler (IRLS) ile yapilir.Robust Regresyon calistirma komutu MASS paketinde rlm’dir.IRLS icin kullanilabilecek cesitli agirlik fonksiyonlari vardir.
Siradan EKK regresyonu ve robust regresyonun sonuclarini karsilastirirken sonuclar cok farkliysa Robust Regresyondan gelen sonuclar kullanilir.Buyuk farkliliklar, model parametrelerinin aykiri degerlerden buyuk oranda etkilendigini gostermektedir.Farkli agirliklandirmalarin avantajlari ve dezavantajlari vardir.Huber agirliklari siddetli aykiri degerlerde zorluklar yasayabilir ve bisquare agirliklar yakinsamada zorluk yasayabilir veya birden fazla cozum verebilir.
Oncelikle Robust Regresyon modelimizi kuralim;
library(MASS)
hubermod <- rlm(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty ,data=golddata)
summary(hubermod)
##
## Call: rlm(formula = Total.Turnover ~ Open + High + Low + Close + WAP +
## No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty,
## data = golddata)
## Residuals:
## Min 1Q Median 3Q Max
## -6606316 -279479 65794 253552 14696066
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -2205824.1561 106581.9606 -20.6960
## Open 5231.7244 15826.3601 0.3306
## High -26970.4520 20581.3416 -1.3104
## Low 97661.6766 21418.2200 4.5597
## Close 47138.0707 30627.0802 1.5391
## WAP -5954.0549 40236.0077 -0.1480
## No..of.Shares 10.3472 0.4196 24.6568
## No..of.Trades 1911.3249 129.3939 14.7714
## Deliverable.Quantity -5.2374 0.5513 -9.5006
## X..Deli..Qty.to.Traded.Qty 17937.1101 1094.6072 16.3868
##
## Residual standard error: 384400 on 1650 degrees of freedom
Huber agirliklari kullanarak kurdugumuz Robust Regresyon Modelimizin Residual Standard Error u 384400 cikmistir.
EKK modelimizde Residual Standard Error’umuz 1204000 cikmisti.
Iki regresyon modelinin katsayilarini yan yana gosterelim;
cbind(coef(model1),coef(hubermod))
## [,1] [,2]
## (Intercept) -3.436162e+06 -2.205824e+06
## Open -1.561467e+04 5.231724e+03
## High 1.438381e+05 -2.697045e+04
## Low 1.591807e+05 9.766168e+04
## Close 1.664999e+05 4.713807e+04
## WAP -3.365059e+05 -5.954055e+03
## No..of.Shares 1.111237e+01 1.034717e+01
## No..of.Trades 4.307138e+03 1.911325e+03
## Deliverable.Quantity -4.674345e+00 -5.237446e+00
## X..Deli..Qty.to.Traded.Qty 2.747152e+04 1.793711e+04
Iki modelin ciktilari arasinda gozle gorulur degisimler vardir.
Simdi Robust Regresyon icin standart artiklari inceleyelim;
op = par(bg = "lavender")
halfnorm(residuals(hubermod),4,ylab = "hubermod residuals")
Robust Regresyonumuzun ham artiklari 1454,1458,1461,1464. gozlemler olarak cikmistir.
stud <- rstudent(hubermod) #robust regresyonunun standart artiklari
stud[which.max(abs(stud))]
## 1458
## 11.94331
1458.gozlem (11.94331) en buyuk standartlastirilmis artik degeridir. Fakat aykiri gozlem midir?
Bonferonni duzeltmesi yaparsak;
#alfa 0.05
qt(.05/(length(stud)*2),length(golddata$Total.Turnover)-10-1) #p=bagimsiz degisken sayisi+1
## [1] -4.184228
1458.gozlem ( |11.94331 | > |-4.184228| ) bonferonni duzeltmesine gore outlierdir.
Diger outlier degerler;
outlierTest(hubermod)
## rstudent unadjusted p-value Bonferroni p
## 1458 11.943307 1.3566e-31 2.2519e-28
## 1461 8.775316 4.1694e-18 6.9213e-15
## 1454 7.721862 1.9752e-14 3.2788e-11
## 1464 7.361245 2.8623e-13 4.7515e-10
## 1400 7.202854 8.9301e-13 1.4824e-09
## 1419 7.005260 3.5782e-12 5.9398e-09
## 1423 6.427678 1.6918e-10 2.8085e-07
## 1413 5.822225 6.9655e-09 1.1563e-05
## 1444 5.781962 8.8140e-09 1.4631e-05
## 1425 5.651794 1.8674e-08 3.0999e-05
En buyuk outlier deger 1458.gozleme ait olmakla birlikte diger outlier degerler ise 1461,1454,1464,1400,1419,1423,1413,1444,1425 degerlere aittir.Swamping: Outlier yuzunden aslinda outlier olmayan bir gozlemin outliermis gibi gozukmesidir.1659,1428.gozlemlerim swamping olmustur.Masking :Bir outlierin baska bir outlieri maskelemesidir.1413 ve 1423.gozlemlerim masking olmustur.
Simdi de bisquare agirliklandirmasini kullanarak regresyon modelimizi kuralim;
bisquaremod <- rlm(Total.Turnover~Open+
High+
Low+
Close+
WAP+
No..of.Shares+
No..of.Trades+
Deliverable.Quantity+
X..Deli..Qty.to.Traded.Qty ,data=golddata,psi=psi.bisquare)
summary(bisquaremod)
##
## Call: rlm(formula = Total.Turnover ~ Open + High + Low + Close + WAP +
## No..of.Shares + No..of.Trades + Deliverable.Quantity + X..Deli..Qty.to.Traded.Qty,
## data = golddata, psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -5765065 -152324 13584 110566 16175436
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1351783.8184 48860.6212 -27.6661
## Open 176375.1390 7255.3158 24.3098
## High -176042.7801 9435.1533 -18.6582
## Low -165684.4742 9818.8054 -16.8742
## Close -311219.4830 14040.4450 -22.1659
## WAP 576059.1375 18445.4885 31.2304
## No..of.Shares 10.8263 0.1924 56.2755
## No..of.Trades 46.6348 59.3183 0.7862
## Deliverable.Quantity -6.4677 0.2527 -25.5920
## X..Deli..Qty.to.Traded.Qty 11613.6728 501.8034 23.1439
##
## Residual standard error: 181600 on 1650 degrees of freedom
Bisquare agirliklari kullanarak kurdugumuz Robust Regresyon Modelimizin Residual Standard Error u 181600 cikmistir. Huber agirliklari kullanarak kurdugumuz Robust Regresyon Modelimizin Residual Standard Error u 384400 cikmisti. EKK modelimizde Residual Standard Error’umuz 1204000 cikmisti.
Bisquare agirliklara kullanarak kurdugumuz Robust Regresyon Modelimizin Residual Standard Erroru daha iyi cikmistir.
Simdi tekrardan agirliklara bakalim;
biweights <- data.frame(state= golddata$Date, resid = bisquaremod$resid, weight = bisquaremod$w)
biweights2 <- biweights[order(bisquaremod$w), ]
biweights2[120:150,]
## state resid weight
## 1408 2010-12-02 -2562021 0
## 1409 2010-12-01 -3152464 0
## 1410 2010-11-30 -3425590 0
## 1411 2010-11-29 -1145606 0
## 1413 2010-11-25 9853137 0
## 1414 2010-11-24 1200350 0
## 1416 2010-11-22 1733448 0
## 1417 2010-11-19 3657511 0
## 1418 2010-11-18 2809126 0
## 1419 2010-11-16 8596703 0
## 1420 2010-11-15 7352802 0
## 1421 2010-11-12 5942499 0
## 1422 2010-11-11 5578841 0
## 1423 2010-11-10 10405847 0
## 1424 2010-11-09 3992821 0
## 1425 2010-11-08 8976603 0
## 1426 2010-11-05 -1974580 0
## 1428 2010-11-03 7052236 0
## 1429 2010-11-02 3736821 0
## 1430 2010-11-01 2540430 0
## 1432 2010-10-28 2514228 0
## 1433 2010-10-27 4208106 0
## 1434 2010-10-26 3570175 0
## 1435 2010-10-25 2487132 0
## 1436 2010-10-22 7288598 0
## 1437 2010-10-21 4606050 0
## 1438 2010-10-20 5882975 0
## 1439 2010-10-19 5314953 0
## 1440 2010-10-18 2909048 0
## 1441 2010-10-15 3962451 0
## 1442 2010-10-14 7601450 0
Ilk sutun Robust Regresyon Modelinin artiklarini gosterir. Resid ler ne kadar buyukse ona karsilik gelen weight degeri de o kadar kucuk olmaktadir.
Iki modelin residual standart error’lerine bakildigi zaman Bisquare yontemi daha kucuk residual standart error degerine sahiptir.